Mastering Data Visualization with R

Author

Martin Schweinberger

Welcome!

What You’ll Learn

By the end of this tutorial, you will be able to:

Choose the right visualization type for your data and research question
Create publication-quality plots using ggplot2
Customize visualizations to tell compelling data stories
Apply best practices for effective data communication
Build complex, multi-layered visualizations step-by-step

Who This Tutorial Is For

This tutorial is designed for:

Beginners who want to learn data visualization from scratch
Intermediate R users looking to enhance their plotting skills
Researchers who need to create professional visualizations for publications
Anyone interested in telling stories with data

Prerequisites

Before starting, make sure you’re familiar with:

Tutorial Structure

This tutorial follows a learn-by-doing approach with three main components:

Concept explanations - Understanding when and why to use each visualization
Step-by-step examples - Building plots from simple to complex
Hands-on exercises - Practice what you’ve learned immediately

Learning Philosophy

Rather than showing you every possible option at once, we’ll build complexity gradually. Each section introduces new concepts that build on what you’ve learned before.

Setup and Preparation

Installing Required Packages

First, let’s install all the packages we’ll need. Run this code once - it may take 3-5 minutes:

Code

# Install core packages
install.packages("dplyr")      # Data manipulation
install.packages("stringr")    # String processing
install.packages("ggplot2")    # Core plotting package
install.packages("tidyr")      # Data reshaping
install.packages("scales")     # Scale functions for ggplot2

# Install specialized plotting packages
install.packages("ggridges")   # Ridge plots
install.packages("ggstats")    # Statistical plots
install.packages("ggstatsplot")# Statistical visualizations
install.packages("EnvStats")   # Environmental statistics

# Install packages for specific plot types
install.packages("likert")     # Likert scale visualizations
install.packages("vcd")        # Categorical data visualization
install.packages("hexbin")     # Hexagonal binning
install.packages("gridExtra")  # Arranging multiple plots

# Install utility packages
install.packages("flextable")  # Pretty tables
install.packages("devtools")   # For installing from GitHub

# Install ggflags from GitHub (for country flags in plots)
devtools::install_github("jimjam-slam/ggflags")

Loading Packages

Now activate the packages for this session:

Code

library(dplyr)
library(stringr)
library(ggplot2)
library(tidyr)
library(flextable)
library(hexbin)
library(gridExtra)
library(ggflags)
library(ggstats)
library(ggridges)
library(EnvStats)
library(scales)

Pro Tip

Create a standard R script with these library calls that you can run at the start of each data visualization session!

Loading the Data

We’ll work with a dataset about preposition usage in historical English texts:

Code

# Load data
pdat <- base::readRDS("tutorials/dviz/data/pvd.rda", "rb")

Let’s examine the structure of our data:

Date	Genre	Text	Prepositions	Region	GenreRedux	DateRedux
1,736	Science	albin	166.01	North	NonFiction	1700-1799
1,711	Education	anon	139.86	North	NonFiction	1700-1799
1,808	PrivateLetter	austen	130.78	North	Conversational	1800-1913
1,878	Education	bain	151.29	North	NonFiction	1800-1913
1,743	Education	barclay	145.72	North	NonFiction	1700-1799
1,908	Education	benson	120.77	North	NonFiction	1800-1913
1,906	Diary	benson	119.17	North	Conversational	1800-1913
1,897	Philosophy	boethja	132.96	North	NonFiction	1800-1913
1,785	Philosophy	boethri	130.49	North	NonFiction	1700-1799
1,776	Diary	boswell	135.94	North	Conversational	1700-1799
1,905	Travel	bradley	154.20	North	NonFiction	1800-1913
1,711	Education	brightland	149.14	North	NonFiction	1700-1799
1,762	Sermon	burton	159.71	North	Religious	1700-1799
1,726	Sermon	butler	157.49	North	Religious	1700-1799
1,835	PrivateLetter	carlyle	124.16	North	Conversational	1800-1913

Understanding Our Data

Our dataset contains:

Date: When the text was written
Genre: Type of text (Fiction, Legal, Religious, etc.)
Text: Name of the source text
Prepositions: Relative frequency of prepositions (per 1,000 words)
Region: Geographic location (North/South)
GenreRedux: Simplified genre categories
DateRedux: Time periods (1150-1499, 1500-1599, etc.)

Setting Up a Color Palette

Let’s create a consistent color scheme for our visualizations:

Code

# Define custom colors
clrs <- c("purple", "gray80", "lightblue", "orange", "gray30")

Why Custom Colors?

Using a consistent color palette across all your visualizations:
- Creates a professional, cohesive look
- Makes your work more recognizable
- Ensures color accessibility
- Saves time (no need to specify colors each time)

Explore more color options:
- R Color Reference
- R Color Palettes

Part 1: Exploring Relationships

In this section, we’ll learn to visualize relationships between variables. We’ll start simple and gradually add complexity.

Scatter Plots: The Foundation

When to use scatter plots: To show the relationship between two continuous (numeric) variables.

Research questions answered:
- Is there a relationship between X and Y?
- Does the relationship vary by group?
- Are there outliers or unusual patterns?

Building Your First Scatter Plot

Let’s create a basic scatter plot step by step:

Code

# Step 1: Most basic scatter plot
ggplot(data = pdat,                    # Our dataset
       aes(x = Date,                   # X-axis variable
           y = Prepositions)) +        # Y-axis variable
  geom_point()                         # Add points

Understanding the Code

ggplot(): Initialize the plot
aes(): Define “aesthetics” (what goes where)
geom_point(): Add a layer of points
+: Add layers together (like building blocks!)

Exercise 1.1: Your First Plot

Try It Yourself!

Create a scatter plot showing the relationship between Date (x-axis) and Prepositions (y-axis) using the code above.

Questions to consider:
1. What pattern do you see?
2. Are prepositions becoming more or less frequent over time?
3. Is the relationship linear or does it curve?

Adding Color: Visualizing Groups

Now let’s add color to distinguish between genres:

Code

ggplot(pdat,
       aes(x = Date,
           y = Prepositions,
           color = GenreRedux)) +        # Color by genre
  geom_point() +
  theme_bw()                             # Clean black & white theme

What changed?
- color = GenreRedux inside aes() colors points by genre
- theme_bw() gives us a cleaner, professional look
- ggplot2 automatically creates a legend!

Customizing Colors and Shapes

Let’s make our plot publication-ready:

Code

ggplot(pdat, 
       aes(Date, Prepositions, 
           color = GenreRedux, 
           shape = GenreRedux)) +          # Different shapes for genres
  geom_point(size = 2) +                   # Larger points
  scale_shape_manual(
    name = "Genre",
    values = 1:5                           # Different point shapes
  ) +
  scale_color_manual(
    name = "Genre",
    values = clrs                          # Our custom colors
  ) +
  theme_bw() +
  theme(legend.position = "top")           # Move legend to top

Design Principle: Redundant Encoding

Using both color AND shape to show genre makes your plot more accessible:
- People with color blindness can use shapes
- Black & white printing preserves information
- Easier to distinguish groups when many overlap

Exercise 1.2: Customize Your Plot

Challenge

Modify the plot above to:
1. Change the theme to theme_minimal() or theme_classic()
2. Move the legend to the bottom
3. Try different point sizes (hint: change the size parameter)

Bonus: Try theme_void() - what happens? Why might this be useful (or not)?

Adding Statistical Layers

Trend Lines: Seeing Patterns

Let’s add trend lines to see patterns more clearly:

Code

ggplot(pdat, aes(Date, Prepositions, color = Genre)) +
  facet_wrap(vars(Genre), ncol = 4) +    # Separate panel per genre
  geom_point(alpha = 0.5) +              # Semi-transparent points
  geom_smooth(method = "lm", se = FALSE) + # Linear trend line
  theme_bw() +
  theme(
    legend.position = "none",             # No legend needed (titles show genre)
    axis.text.x = element_text(size = 8, angle = 90)
  )

New concepts:
- facet_wrap(): Create separate panels for each group
- alpha = 0.5: Make points semi-transparent (50% opacity)
- geom_smooth(): Add a smoothed trend line
- method = "lm": Use linear regression
- se = FALSE: Don’t show confidence interval

When to Use Facets

Facets (separate panels) work best when:
- You have 3-8 groups to compare
- Patterns within groups are important
- Overlapping points make one plot hard to read

Avoid facets when:
- You need to directly compare values across groups
- You have too many groups (>10)

Exercise 1.3: Exploring Trends

Analysis Task

Using the faceted plot above:
1. Which genre shows the strongest trend over time?
2. Which genres have increasing vs. decreasing preposition use?
3. Try changing method = "lm" to method = "loess" - what’s different?

Discussion: When might a curved line (loess) be more appropriate than a straight line (lm)?

Density Overlays: Alternative to Points

Sometimes you have too many overlapping points. Here’s an alternative:

Code

ggplot(pdat, aes(x = Date, y = Prepositions, color = GenreRedux)) +
  facet_wrap(vars(GenreRedux), ncol = 5) +
  geom_density_2d() +                    # 2D density contours
  theme_bw() +
  theme(
    legend.position = "none",
    axis.text.x = element_text(size = 8, angle = 90)
  )

What are density contours? Think of them like topographic map lines - they show where data points are concentrated.

Quick Comparison Table

Visualization	Best For	Limitations
Points	Small-medium datasets, seeing all data	Gets messy with many points
Trend lines	Showing overall patterns	Hides individual variation
Density contours	Large datasets, concentration patterns	Harder to interpret
Hex bins (next!)	Very large datasets	Requires uniform X-Y scales

Hex Plots: Handling Big Data

When you have thousands of points, hex plots show density efficiently:

Code

pdat |>
  ggplot(aes(x = Date, y = Prepositions)) +
  geom_hex() +                          # Hexagonal binning
  scale_fill_gradient(low = "lightblue", high = "darkblue") +
  theme_bw()

Darker hexagons = more data points in that region.

Exercise 1.4: Comparing Approaches

Synthesis Challenge

Create three plots of the same data:
1. A scatter plot with geom_point()
2. A density plot with geom_density_2d()
3. A hex plot with geom_hex()

Reflect:
- What different insights does each provide?
- Which would you use in a paper? A presentation? An exploratory analysis?

Part 2: Showing Distributions

Understanding distributions helps us see patterns, outliers, and the “shape” of our data.

Density Plots: Smooth Distribution Curves

When to use: To show how values are distributed, especially comparing groups.

Code

ggplot(pdat, aes(Date, fill = Region)) +
  geom_density(alpha = 0.5) +           # Semi-transparent densities
  scale_fill_manual(values = clrs[1:2]) +
  theme_bw() +
  theme(legend.position = c(0.1, 0.9))  # Position inside plot area

Reading density plots:
- X-axis: Values of the variable (Date)
- Y-axis: Density (higher = more data points)
- Peaks: Most common values
- Width: Spread of the data

Interpreting This Plot

The plot shows that:
- Southern texts continue into the 1800s
- Northern texts end around 1700
- There’s an overlap period where both regions produced texts

Exercise 2.1: Distribution Detective

Investigation

Create a density plot of Prepositions (not Date), colored by GenreRedux.

Questions:
1. Which genre has the highest average preposition frequency?
2. Which genre shows the most variation (widest distribution)?
3. Do any genres have unusual distributions (multiple peaks, asymmetry)?

Histograms: Counting in Bins

Histograms are similar to density plots but show actual counts:

Code

ggplot(pdat, aes(Prepositions)) +
  geom_histogram(bins = 30,              # Number of bins
                 fill = "steelblue",
                 color = "white") +      # Outline color
  theme_bw() +
  labs(title = "Distribution of Preposition Frequencies",
       x = "Prepositions per 1,000 words",
       y = "Count")

Comparing Groups with Histograms

Code

ggplot(pdat, aes(Prepositions, fill = Region)) +
  geom_histogram(bins = 30, alpha = 0.6, position = "identity") +
  scale_fill_manual(values = clrs[1:2]) +
  theme_bw() +
  theme(legend.position = "top")

Histogram vs. Bar Plot

Don’t confuse these!
- Histogram: Shows distribution of ONE continuous variable (bins are ranges)
- Bar plot: Shows counts/values for CATEGORIES (bars are discrete groups)

Exercise 2.2: Finding the Right Bin Width

Experiment

Create three histograms of Prepositions with different numbers of bins:
1. bins = 10
2. bins = 30
3. bins = 100

Discuss:
- Too few bins: What information is lost?
- Too many bins: What problems arise?
- How do you choose the “right” number?

Hint: Try the Freedman-Diaconis rule: bins = 30 is often a good starting point.

Ridge Plots: Beautiful Distribution Comparisons

Ridge plots elegantly show multiple distributions:

Code

library(ggridges)

pdat |>
  ggplot(aes(x = Prepositions, y = GenreRedux, fill = GenreRedux)) +
  geom_density_ridges() +
  theme_ridges() +
  theme(legend.position = "none") +
  labs(y = "", 
       x = "Relative frequency of prepositions")

Why ridge plots are great:
- Easy to compare shapes across many groups
- Aesthetically pleasing
- Popular in modern data visualization

Exercise 2.3: Ridge Plot Exploration

Create and Customize

Create a ridge plot of Prepositions by DateRedux (instead of GenreRedux)
Add color with scale_fill_manual(values = clrs)
Try geom_density_ridges(alpha = 0.6, stat = "binline", bins = 20) - what changes?

Bonus: Research what stat = "binline" does. Why might you choose this over smooth densities?

Boxplots: The Statistical Summary

Boxplots show five key statistics at once:

Code

ggplot(pdat, aes(DateRedux, Prepositions, fill = DateRedux)) +
  geom_boxplot() +
  scale_fill_manual(values = clrs) +
  theme_bw() +
  theme(legend.position = "none") +
  labs(x = "Time Period", 
       y = "Prepositions (per 1,000 words)")

Reading a Boxplot

![Anatomy of a boxplot - showing median, quartiles, whiskers, and outliers]

Line in box: Median (50th percentile)
Box: Interquartile range (IQR) - middle 50% of data
Whiskers: Extend to 1.5 × IQR
Dots: Outliers beyond whiskers

Notched Boxplots: Testing Differences

Code

ggplot(pdat, aes(DateRedux, Prepositions, fill = DateRedux)) +
  geom_boxplot(notch = TRUE,             # Add notches
               outlier.colour = "red",
               outlier.shape = 2,
               outlier.size = 3) +
  scale_fill_manual(values = clrs) +
  theme_bw() +
  theme(legend.position = "none")

The Notch Test

If notches of two boxes don’t overlap → strong evidence groups differ significantly.

This is a visual “rough test” - not a replacement for proper statistics!

Enhanced Boxplots with Individual Points

Code

library(EnvStats)

ggplot(pdat, aes(DateRedux, Prepositions, fill = DateRedux, color = DateRedux)) +
  geom_boxplot(varwidth = TRUE,          # Width proportional to sample size
               color = "black", 
               alpha = 0.3) +
  geom_jitter(alpha = 0.3,               # Add individual points
              height = 0,                 # Don't jitter vertically
              width = 0.2) +              # Small horizontal spread
  facet_grid(~Region) +
  EnvStats::stat_n_text(y.pos = 65) +    # Add sample sizes
  theme_bw() +
  theme(legend.position = "none") +
  labs(x = "", 
       y = "Frequency (per 1,000 words)",
       title = "Preposition Use Across Time and Regions")

Exercise 2.4: Boxplot Mastery

Advanced Challenge

Create a boxplot of Prepositions by GenreRedux
Add notches
Add jittered points
Color by genre
Add appropriate labels

Analysis questions:
- Which genres show the most variation?
- Are there any outliers? What might they represent?
- Do any genre pairs show non-overlapping notches?

Violin Plots: Best of Both Worlds

Violin plots combine boxplot statistics with density shapes:

Code

ggplot(pdat, aes(DateRedux, Prepositions, fill = DateRedux)) +
  geom_violin(trim = FALSE, alpha = 0.5) +
  scale_fill_manual(values = clrs) +
  theme_bw() +
  theme(legend.position = "none")

Violin plots show:
- Distribution shape (like density plots)
- Median and quartiles (like boxplots)
- Multimodal distributions (multiple peaks)

When to Choose Each Plot Type

Plot Type	Best For	Avoid When
Histogram	Single variable, showing counts	Comparing many groups
Density	Smooth distributions, comparisons	Need exact counts
Ridge	Many groups, emphasis on shapes	<3 groups
Boxplot	Statistical summary, outliers	Distribution shape matters
Violin	Shape + summary, detecting multimodality	Small sample sizes

Exercise 2.5: Distribution Showdown

Comparative Analysis

For the variable Prepositions grouped by GenreRedux, create:
1. A ridge plot
2. A boxplot
3. A violin plot

Reflection:
- What does each reveal that the others don’t?
- If you could only show ONE plot in a paper, which would you choose and why?
- How does sample size affect each plot type?

Part 3: Categorical Data

Working with categorical variables requires different approaches. Let’s explore the options!

Bar Plots: The Workhorse of Categories

First, let’s create summary data:

Code

bdat <- pdat |>
  dplyr::mutate(DateRedux = factor(DateRedux)) |>
  group_by(DateRedux) |>
  dplyr::summarise(Frequency = n()) |>
  dplyr::mutate(Percent = round(Frequency / sum(Frequency) * 100, 1))

# View the data
bdat

# A tibble: 5 × 3
  DateRedux Frequency Percent
  <fct>         <int>   <dbl>
1 1150-1499        34     6.3
2 1500-1599       180    33.5
3 1600-1699       225    41.9
4 1700-1799        53     9.9
5 1800-1913        45     8.4

Basic Bar Plot

Code

ggplot(bdat, aes(DateRedux, Percent, fill = DateRedux)) +
  geom_bar(stat = "identity") +          # Use actual values
  geom_text(aes(y = Percent - 3,         # Position labels
                label = paste0(Percent, "%")), 
            color = "white", 
            size = 4) +
  scale_fill_manual(values = clrs) +
  theme_bw() +
  theme(legend.position = "none") +
  labs(x = "Time Period",
       y = "Percentage of Documents",
       title = "Distribution of Texts Across Time Periods")

stat = "identity" Explained

geom_bar() by default counts occurrences (stat = "count")
Use stat = "identity" when your data already contains the values to plot
Think: “plot the values AS IS (their identity)”

Grouped Bar Plots

Code

ggplot(pdat, aes(Region, fill = DateRedux)) +
  geom_bar(position = position_dodge(),  # Side-by-side bars
           stat = "count") +
  scale_fill_manual(values = clrs) +
  theme_bw() +
  labs(x = "Region",
       y = "Number of Documents",
       fill = "Time Period")

When to use grouped bars:
- Comparing sub-categories within main categories
- 2-3 sub-groups work best
- Direct comparison between groups is important

Stacked Bar Plots

Code

ggplot(pdat, aes(DateRedux, fill = GenreRedux)) +
  geom_bar(stat = "count") +
  scale_fill_manual(values = clrs) +
  theme_bw() +
  labs(x = "Time Period",
       y = "Number of Documents",
       fill = "Genre",
       title = "Genre Composition Across Time Periods")

Normalized Stacked Bars (100%)

Code

ggplot(pdat, aes(DateRedux, fill = GenreRedux)) +
  geom_bar(stat = "count", position = "fill") +
  scale_fill_manual(values = clrs) +
  scale_y_continuous(labels = scales::percent) +  # Format as percentages
  theme_bw() +
  labs(x = "Time Period",
       y = "Proportion of Documents",
       fill = "Genre",
       title = "Relative Genre Composition Over Time")

Choosing Bar Plot Types

Grouped bars when:
- Comparing specific values across groups
- You have 2-3 subgroups
- Actual counts matter

Stacked bars when:
- Showing composition (parts of a whole)
- Total amount is important
- You have 3-6 subgroups

100% stacked when:
- Only proportions matter (not absolute values)
- Emphasizing compositional changes

Exercise 3.1: Bar Plot Practice

Build Your Skills

Create a grouped bar plot showing GenreRedux by Region
Create a stacked bar plot of the same data
Create a 100% stacked version

Questions: - Which plot makes it easiest to compare genre frequencies between regions?
- Which shows total document counts best?
- What story does the 100% stacked version tell?

Likert Scale Visualizations

Survey data with Likert scales (Strongly Disagree → Strongly Agree) needs special treatment.

First, let’s load some survey data:

Code

ldat <- base::readRDS("tutorials/dviz/data/lid.rda", "rb")
head(ldat)

   Course Satisfaction
1 Chinese            1
2 Chinese            1
3 Chinese            1
4 Chinese            1
5 Chinese            1
6 Chinese            1

Method 1: Grouped Bar Plot

Code

# Summarize the data
nlik <- ldat |>
  dplyr::group_by(Course, Satisfaction) |>
  dplyr::summarize(Frequency = n())

# Create grouped bar plot
ggplot(nlik, aes(Satisfaction, Frequency, fill = Course)) +
  geom_bar(stat = "identity", position = position_dodge()) +
  scale_fill_manual(values = clrs[1:3]) +
  geom_text(aes(label = Frequency),
            vjust = 1.6, color = "white",
            position = position_dodge(0.9), size = 3.5) +
  scale_x_discrete(
    limits = 1:5,
    labels = c("Very\nDissatisfied", "Dissatisfied", 
               "Neutral", "Satisfied", "Very\nSatisfied")
  ) +
  theme_bw() +
  labs(title = "Student Satisfaction by Course",
       x = "Satisfaction Level",
       y = "Number of Students")

Method 2: Cumulative Line Graph

Code

ggplot(ldat, aes(x = Satisfaction, color = Course)) +
  geom_step(aes(y = ..y..), stat = "ecdf", size = 1.5) +
  scale_colour_manual(values = clrs[1:3]) +
  scale_x_discrete(
    limits = 1:5,
    labels = c("Very\nDissatisfied", "Dissatisfied", 
               "Neutral", "Satisfied", "Very\nSatisfied")
  ) +
  theme_bw() +
  labs(title = "Cumulative Satisfaction Distribution",
       y = "Cumulative Proportion",
       x = "Satisfaction Level")

Reading Cumulative Plots

Steeper lines = responses concentrated in that range
Higher line at left = more dissatisfied responses
Lines that cross = different distribution patterns
Gap between lines = difference in satisfaction

Method 3: gglikert (Modern Approach)

Code

# Load survey data with multiple questions
sdat <- base::readRDS("tutorials/dviz/data/sdd.rda", "rb")

# Clean column names
colnames(sdat)[3:ncol(sdat)] <- paste0(
  "Q", str_pad(1:10, 2, "left", "0"), ": ",
  colnames(sdat)[3:ncol(sdat)]
) |>
  stringr::str_replace_all("\\.", " ") |>
  stringr::str_squish() |>
  stringr::str_replace_all("$", "?")

# Convert to factors with labels
lbs <- c("Disagree", "Somewhat\nDisagree", "Neutral", 
         "Somewhat\nAgree", "Agree")

survey <- sdat |>
  dplyr::mutate_if(is.character, factor) |>
  dplyr::mutate_if(is.numeric, factor, levels = 1:5, labels = lbs) |>
  drop_na() |>
  as.data.frame()

# Create gglikert plot
survey |>
  dplyr::select(matches("01|02|03|04")) |>
  gglikert(labels_size = 2.5,
           add_labels = FALSE) +
  ggtitle("Survey Responses to Selected Questions") +
  scale_fill_brewer(palette = "RdBu")

Likert Best Practices

Order matters: Keep response scales in order (don’t sort by frequency)
Neutral center: Place neutral/midpoint in the middle
Diverging colors: Use colors that diverge from center (e.g., Red-Blue)
Group facets: Use for comparing sub-groups
Consider n: Show sample sizes when comparing groups

Exercise 3.2: Survey Visualization Challenge

Real-World Application

Imagine you’ve surveyed 100 students about their experience in an online course. Create visualizations to show:

Overall satisfaction distribution (use ldat as an example)
Comparison between different courses
Which visualization would you use in:
- An academic paper?
- A presentation to administrators?
- A quick report to instructors?

Reflect: How does your choice of visualization affect the “story” the data tells?

Pie Charts: Use With Caution

Design Warning

Pie charts are popular but problematic:
- Hard to compare slice sizes
- Difficult to estimate percentages
- Problematic with many categories
- Bar plots almost always work better

When pies might be okay:
- Very few categories (2-3)
- One category is dominant (~50%+)
- Showing parts of a whole is crucial

Here’s how to make one anyway (for comparison):

Code

# Create data for pie chart
piedata <- bdat |>
  dplyr::arrange(desc(DateRedux)) |>
  dplyr::mutate(Position = cumsum(Percent) - 0.5 * Percent)

# Create side-by-side comparison
p1 <- ggplot(bdat, aes("", Percent, fill = DateRedux)) +
  geom_bar(stat = "identity", position = position_dodge(), width = 0.7) +
  scale_fill_manual(values = clrs) +
  theme_minimal() +
  labs(title = "Bar Plot", y = "Percent")

p2 <- ggplot(piedata, aes("", Percent, fill = DateRedux)) +
  geom_bar(stat = "identity", width = 1, color = "white") +
  coord_polar("y", start = 0) +
  scale_fill_manual(values = clrs) +
  theme_void() +
  geom_text(aes(y = Position, label = paste0(Percent, "%")), 
            color = "white", size = 4) +
  labs(title = "Pie Chart")

gridExtra::grid.arrange(p1, p2, nrow = 1)

Which is easier to interpret? Why?

Exercise 3.3: Pie vs. Bar Debate

Critical Thinking

Look at the comparison above.

Without looking at the numbers, which time period has the highest percentage in the pie chart?
Try the same question with the bar plot.
Which differences are easier to see?

Challenge: Find a situation where a pie chart might actually be the better choice. Share your reasoning!

Part 4: Advanced Visualizations

Now that you’ve mastered the basics, let’s explore some specialized and advanced plot types.

Heatmaps: Visualizing Matrices

Heatmaps use color to represent values in a matrix or table.

Code

# Create and scale data
heatdata <- pdat |>
  dplyr::group_by(DateRedux, GenreRedux) |>
  dplyr::summarise(Prepositions = mean(Prepositions)) |>
  tidyr::spread(DateRedux, Prepositions)

heatmx <- as.matrix(heatdata[, 2:5])
rownames(heatmx) <- heatdata$GenreRedux
heatmx <- scale(heatmx)  # Standardize

Code

heatmap(heatmx, 
        scale = "none",           # Already scaled
        col = colorRampPalette(c("blue", "white", "red"))(50),
        margins = c(7, 10))       # Adjust label margins

Reading heatmaps:
- Color intensity: Magnitude of value
- Dendrograms (tree diagrams): Show clustering/similarity
- Rows/columns: Can be reordered to reveal patterns

When to Use Heatmaps

Showing patterns in large matrices
Gene expression data
Correlation matrices
Time-series across categories
Survey responses across questions

Avoid when: - Data is sparse (many missing values)
- Categories don’t have natural ordering
- Precise values matter more than patterns

Association Plots: Expected vs. Observed

Association plots show deviations from expected frequencies:

Code

library(vcd)

# Prepare data
assocdata <- pdat |>
  dplyr::mutate(
    GenreRedux = dplyr::case_when(
      GenreRedux == "Conversational" ~ "Conv.",
      GenreRedux == "Religious" ~ "Relig.",
      TRUE ~ GenreRedux
    )
  ) |>
  dplyr::group_by(GenreRedux, DateRedux) |>
  dplyr::summarise(Prepositions = round(mean(Prepositions), 0)) |>
  tidyr::spread(DateRedux, Prepositions)

assocmx <- as.matrix(assocdata[, 2:6])
rownames(assocmx) <- assocdata$GenreRedux

Code

assoc(assocmx, shade = TRUE,
      main = "Association Plot: Genre × Time Period")

Interpreting association plots:
- Above the line: More than expected
- Below the line: Less than expected
- Blue shading: Significantly more than expected
- Red shading: Significantly less than expected
- Bar width: Contribution to chi-square statistic

Mosaic Plots: Proportional Rectangles

Code

mosaic(assocmx, shade = TRUE, legend = TRUE,
       main = "Mosaic Plot: Genre Composition Over Time")

Reading mosaic plots:
- Rectangle size: Proportion of total
- Color: Deviation from expected (like association plots)
- Position: Shows conditional relationships

Mosaic vs. Association Plots

Mosaic plots:
- Show proportions visually through rectangle size
- Better for understanding composition
- Good for presentations

Association plots:
- Emphasize statistical significance
- Better for identifying specific deviations
- Good for detailed analysis

Word Clouds: Visualizing Text

Word clouds show word frequencies. Let’s analyze political speeches:

Code

library(quanteda)
library(quanteda.textplots)

# Load speeches
clinton <- base::readRDS("tutorials/dviz/data/Clinton.rda", "rb") |> 
  paste0(collapse = " ")
trump <- base::readRDS("tutorials/dviz/data/Trump.rda", "rb") |> 
  paste0(collapse = " ")

# Create corpus
corp_dom <- quanteda::corpus(c(clinton, trump))
attr(corp_dom, "docvars")$Author <- c("Clinton", "Trump")

# Process text
corp_dom <- corp_dom |>
  quanteda::tokens(remove_punct = TRUE) |>
  quanteda::tokens_remove(stopwords("english")) |>
  quanteda::dfm() |>
  quanteda::dfm_group(groups = corp_dom$Author) |>
  quanteda::dfm_trim(min_termfreq = 200, verbose = FALSE)

Simple Word Cloud

Code

corp_dom |>
  quanteda.textplots::textplot_wordcloud(comparison = FALSE,
                                         max_words = 50)

Comparison Cloud

Code

corp_dom |>
  quanteda.textplots::textplot_wordcloud(
    comparison = TRUE,
    max_words = 50,
    color = c("blue", "red")
  )

Word Cloud Limitations

Problems:
- Words sizes are hard to compare precisely
- Common words dominate even after removing stop words
- No context (meaning can be misleading)
- Can misrepresent emphasis

Better for:
- Initial exploration
- Public presentations (engaging but not precise)
- Showing overall themes
- Complementing (not replacing) quantitative analysis

Exercise 4.1: Text Analysis

Interpretation Challenge

Looking at the comparison cloud above:

What themes differentiate Clinton from Trump?
What do the largest words in each color suggest about their campaign focus?
What are the limitations of this visualization?
What additional analyses would you want to do?

Bonus: Research “topic modeling” - how might this provide deeper insights than word clouds?

Flags in Visualizations

Adding country flags can make international comparisons more engaging:

Code

flagsdf <- data.frame(
  Region = c("Australia", "Canada", "Great Britain", "India", 
             "Ireland", "New Zealand", "United States"),
  Percent = c(0.022, 0.017, 0.025, 0.010, 0.019, 0.020, 0.036),
  Kachru = c("Inner circle", "Inner circle", "Inner circle", "Outer circle",
             "Inner circle", "Inner circle", "Inner circle"),
  country = c("au", "ca", "gb", "in", "ie", "nz", "us")
)

Code

flagsdf |>
  ggplot(aes(x = reorder(Region, Percent), 
             y = Percent, 
             country = country,
             fill = Kachru)) +
  geom_bar(stat = "identity") +
  ggflags::geom_flag(size = 5) +
  geom_text(aes(label = scales::percent(Percent, accuracy = 0.1)),
            hjust = -0.3, size = 3) +
  coord_flip(ylim = c(0, 0.045)) +
  scale_fill_manual(values = c("lightblue", "coral")) +
  scale_y_continuous(labels = scales::percent) +
  theme_minimal() +
  labs(x = "", 
       y = "Vulgar Language Percentage",
       title = "Vulgar Language Use by English-Speaking Region",
       fill = "English Type") +
  theme(legend.position = c(0.8, 0.3),
        panel.grid.major = element_blank())

When to Use Flags

Good for:
- International comparisons
- Making data more accessible to general audiences
- Adding visual interest to country-level data

Requirements:
- Need ISO country codes (e.g., “us”, “gb”, “au”)
- Works best with horizontal bar plots
- Don’t overuse - can look unprofessional in some contexts

Part 5: Time Series and Lines

Time series data shows how things change over time. Line graphs are the go-to visualization.

Basic Line Graphs

Code

pdat |>
  dplyr::group_by(DateRedux, GenreRedux) |>
  dplyr::summarise(Frequency = mean(Prepositions)) |>
  ggplot(aes(x = DateRedux, y = Frequency, 
             group = GenreRedux, 
             color = GenreRedux)) +
  geom_line(size = 1.2) +
  geom_point(size = 3) +               # Add points at data locations
  scale_color_manual(values = clrs) +
  theme_minimal() +
  labs(title = "Preposition Frequency Over Time by Genre",
       x = "Time Period",
       y = "Mean Frequency (per 1,000 words)",
       color = "Genre")

Line Graph Essentials

Points: Show actual data locations
Lines: Show trends/connections
Group aesthetic: Tells ggplot which points to connect
Color: Distinguishes different series

Smoothed Line Graphs

For continuous time variables, smoothing reveals trends:

Code

ggplot(pdat, aes(x = Date, y = Prepositions, 
                 color = GenreRedux, 
                 linetype = GenreRedux)) +
  geom_smooth(se = FALSE, size = 1.2) +
  scale_linetype_manual(
    values = c("solid", "dashed", "dotted", "dotdash", "longdash"),
    name = "Genre"
  ) +
  scale_colour_manual(values = clrs, name = "Genre") +
  theme_bw() +
  theme(legend.position = "top") +
  labs(x = "Year", 
       y = "Relative Frequency\n(per 1,000 words)",
       title = "Smoothed Trends in Preposition Use")

Why smooth?
- Reduces noise from individual data points
- Shows overall trends more clearly
- Uses LOESS (locally weighted smoothing) by default
- Helpful when you have many data points

Exercise 5.1: Trends Over Time

Time Series Analysis

Using the smoothed line graph:

Which genre shows the strongest increasing trend?
Which genre appears most stable over time?
Are there any periods of rapid change?
Try adding se = TRUE to show confidence intervals - what does this add?

Bonus: Create the same plot but facet by Region - do regional patterns differ?

Ribbon Plots: Showing Uncertainty

Ribbon plots display ranges (like min/max or confidence intervals):

Code

pdat |>
  dplyr::mutate(DateRedux = as.numeric(DateRedux)) |>
  dplyr::group_by(DateRedux) |>
  dplyr::summarise(
    Mean = mean(Prepositions),
    Min = min(Prepositions),
    Max = max(Prepositions),
    SD = sd(Prepositions)
  ) |>
  ggplot(aes(x = DateRedux, y = Mean)) +
  geom_ribbon(aes(ymin = Mean - SD,     # ±1 SD ribbon
                  ymax = Mean + SD), 
              fill = "lightblue", 
              alpha = 0.4) +
  geom_ribbon(aes(ymin = Min,           # Min-max ribbon
                  ymax = Max), 
              fill = "gray80", 
              alpha = 0.3) +
  geom_line(size = 1.2, color = "darkblue") +
  scale_x_continuous(labels = names(table(pdat$DateRedux))) +
  theme_minimal() +
  labs(title = "Preposition Frequency: Mean with Variation",
       x = "Time Period",
       y = "Frequency (per 1,000 words)") +
  ggplot2::annotate("text", x = 2.5, y = 180, 
           label = "Gray = Min-Max range", size = 3) +
  ggplot2::annotate("text", x = 2.5, y = 170, 
           label = "Blue = ±1 SD", size = 3)

Ribbon plots are excellent for:
- Showing uncertainty
- Displaying confidence intervals
- Visualizing ranges in forecasts
- Comparing variability across time

Part 6: Specialized Plots

Let’s explore some specialized plot types for specific scenarios.

Balloon Plots

Balloon plots show three variables: two categorical and one continuous.

Code

pdat |>
  dplyr::mutate(DateRedux = factor(DateRedux)) |>
  dplyr::group_by(DateRedux, GenreRedux) |>
  dplyr::summarise(Prepositions = mean(Prepositions)) |>
  ggplot(aes(DateRedux, GenreRedux,
             size = Prepositions,
             fill = GenreRedux)) +
  geom_point(shape = 21, alpha = 0.7) +
  scale_size_area(max_size = 20) +
  scale_fill_manual(values = clrs) +
  theme_minimal() +
  theme(legend.position = "none",
        panel.grid.major = element_line(color = "gray90")) +
  labs(title = "Preposition Frequency: Genre × Time Period",
       x = "Time Period",
       y = "Genre",
       size = "Frequency")

When to use balloon plots:
- Showing three variables simultaneously
- Matrix-style comparisons
- When circle size is intuitive for your audience

Limitations:
- Hard to compare sizes precisely
- Can get crowded with many categories
- Consider a heatmap as an alternative

Dot Plots with Error Bars

Showing means with confidence intervals:

Code

ggplot(pdat, aes(x = reorder(Genre, Prepositions, mean), 
                 y = Prepositions,
                 group = Genre)) +
  stat_summary(fun = mean,               # Plot means
               geom = "point", 
               size = 4,
               aes(color = Genre)) +
  stat_summary(fun.data = mean_cl_boot,  # Bootstrap CI
               geom = "errorbar", 
               width = 0.2,
               size = 1) +
  coord_cartesian(ylim = c(80, 200)) +
  #scale_color_manual(values = clrs) +
  theme_bw(base_size = 12) +
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1),
    legend.position = "none"
  ) +
  labs(x = "", 
       y = "Prepositions (per 1,000 words)",
       title = "Mean Preposition Frequency by Genre",
       subtitle = "Error bars show 95% confidence intervals")

Error Bars vs. Boxplots

Error bars show:
- Specific statistic (mean, median)
- Specific uncertainty measure (SE, CI, SD)
- Cleaner look for publications

Boxplots show:
- More distributional information
- Quartiles and outliers
- Better for detecting skewness

Exercise 6.1: Comparison Challenge

Statistical Visualization

Create two plots of Prepositions by GenreRedux:
1. A dot plot with error bars (use code above)
2. A boxplot

Compare:
- What does each tell you?
- Which shows outliers better?
- Which would you use to claim “Genre X has higher frequency than Genre Y”?
- When would you choose each?

Comparative Bar Plots with Negatives

Sometimes you want to show deviation from a reference:

Code

# Create example data
Test1 <- c(11.2, 13.5, 200, 185, 1.3, 3.5)
Test2 <- c(12.2, 14.7, 210, 175, 1.9, 3.0)
Test3 <- c(13.2, 15.1, 177, 173, 2.4, 2.9)

testdata <- data.frame(Test1, Test2, Test3)
rownames(testdata) <- c(
  "Feature1_Student", "Feature1_Reference",
  "Feature2_Student", "Feature2_Reference",
  "Feature3_Student", "Feature3_Reference"
)

# Calculate deviations
FeatureA <- t(testdata[1, ] - testdata[2, ])
FeatureB <- t(testdata[3, ] - testdata[4, ])
FeatureC <- t(testdata[5, ] - testdata[6, ])

plottable <- data.frame(
  Test = rep(rownames(FeatureA), 3),
  Value = c(FeatureA, FeatureB, FeatureC),
  Feature = rep(c("FeatureA", "FeatureB", "FeatureC"), each = 3)
)

# Plot divergence
ggplot(plottable, aes(Test, Value, fill = Test)) +
  facet_grid(vars(Feature), scales = "free_y") +
  geom_bar(stat = "identity") +
  geom_hline(yintercept = 0, linetype = "dashed", color = "red") +
  scale_fill_manual(values = clrs[1:3]) +
  theme_bw() +
  theme(legend.position = "none") +
  labs(x = "Test",
       y = "Deviation from Reference",
       title = "Learner Performance: Deviation from Native Speakers",
       subtitle = "Positive = Above reference, Negative = Below reference")

Use cases:
- Language learner vs. native speaker comparisons
- Treatment vs. control groups
- Actual vs. expected values
- Change from baseline

Part 7: Publication-Ready Plots

Let’s pull everything together to create publication-quality visualizations.

The Anatomy of a Perfect Plot

A publication-ready plot needs:

Clear title and subtitle
Axis labels with units
Legend (when needed)
Appropriate theme
Readable fonts
Colorblind-friendly palette
Proper sizing
Citation/source (when relevant)

Example: Building a Complete Plot

Code

pdat |>
  dplyr::group_by(DateRedux, GenreRedux) |>
  dplyr::summarise(
    Mean = mean(Prepositions),
    SE = sd(Prepositions) / sqrt(n()),
    N = n()
  ) |>
  ggplot(aes(x = DateRedux, y = Mean, 
             color = GenreRedux, 
             group = GenreRedux)) +
  # Data layers
  geom_line(size = 1.2) +
  geom_point(size = 3) +
  geom_errorbar(aes(ymin = Mean - SE, ymax = Mean + SE),
                width = 0.2, size = 0.8) +
  # Scales
  scale_color_manual(
    name = "Text Genre",
    values = clrs,
    labels = c("Conversational", "Fiction", "Legal", 
               "Non-fiction", "Religious")
  ) +
  scale_y_continuous(
    breaks = seq(100, 200, 20),
    limits = c(100, 200)
  ) +
  # Theme and labels
  theme_bw(base_size = 14) +
  theme(
    legend.position = c(0.15, 0.65),
    legend.background = element_rect(fill = "white", color = "black"),
    panel.grid.minor = element_blank(),
    plot.title = element_text(face = "bold", size = 16),
    plot.subtitle = element_text(size = 12, color = "gray30"),
    plot.caption = element_text(size = 10, hjust = 0)
  ) +
  labs(
    title = "Historical Trends in Preposition Usage",
    subtitle = "Analysis of English texts from 1150-1913",
    x = "Time Period",
    y = "Mean Frequency (per 1,000 words)",
    caption = "Source: Penn Parsed Corpora of Historical English (PPC)\nError bars show ±1 SE"
  )

Saving High-Quality Figures

Code

# Save for publication
ggsave("preposition_trends.png",
       width = 10, height = 6, dpi = 300)

# Save for presentation
ggsave("preposition_trends.pdf",
       width = 10, height = 6)

# Save for web
ggsave("preposition_trends_web.png",
       width = 10, height = 6, dpi = 150)

File Format Guide

PNG - Best for:
- Web use
- Presentations
- Figures with photos or complex gradients
- When file size matters

PDF - Best for:
- Publications (journals often require vector)
- Posters
- When scaling is needed
- Print materials

TIFF - Best for:
- Some journal requirements
- Archival purposes

DPI (resolution):
- Web: 72-150 dpi
- Presentations: 150 dpi
- Print: 300 dpi
- Posters: 600 dpi

Color Accessibility

Making plots accessible to colorblind readers:

Code

library(viridis)

# Original plot with problematic colors
p1 <- pdat |>
  dplyr::group_by(DateRedux, GenreRedux) |>
  dplyr::summarise(Mean = mean(Prepositions)) |>
  ggplot(aes(DateRedux, Mean, fill = GenreRedux)) +
  geom_bar(stat = "identity", position = "dodge") +
  scale_fill_manual(values = c("red", "green", "blue", "yellow", "purple")) +
  ggtitle("Problematic Colors") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1))

# Improved with viridis palette
p2 <- pdat |>
  dplyr::group_by(DateRedux, GenreRedux) |>
  dplyr::summarise(Mean = mean(Prepositions)) |>
  ggplot(aes(DateRedux, Mean, fill = GenreRedux)) +
  geom_bar(stat = "identity", position = "dodge") +
  scale_fill_viridis_d() +
  ggtitle("Colorblind-Friendly") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1))

gridExtra::grid.arrange(p1, p2, nrow = 1)

Colorblind-friendly palettes:
- scale_color_viridis_d() / scale_fill_viridis_d()
- scale_color_brewer() with “Set2”, “Dark2”, or “Paired”
- ColorBrewer palettes (many are colorblind-safe)

Exercise 7.1: Publication Polish

Final Project

Create a publication-ready visualization:

Choose any relationship in the data
Create a complete plot with:
- Informative title and subtitle
- Proper axis labels with units
- A colorblind-friendly palette
- Appropriate theme
- Source citation
- Legend if needed
Save it in three formats (PNG, PDF, web-optimized PNG)
Write a 2-3 sentence caption that could accompany the figure in a paper

Peer review: Exchange with a colleague - is your plot self-explanatory?

Part 8: Choosing the Right Plot

The hardest part of data visualization is choosing which plot to make. Let’s develop a decision framework.

Decision Tree

Start Here: What’s Your Data Structure?

1. One Continuous Variable

Goal: Show distribution

Few data points (<50): Dot plot, strip plot
Medium data (50-500): Histogram, density plot
Many data (500+): Density plot, violin plot
Want statistics: Boxplot

2. One Continuous + One Categorical

Goal: Compare groups

Compare distributions: Boxplot, violin plot, ridge plot
Compare means: Dot plot with error bars
Show all data: Jittered points, beeswarm plot

3. Two Continuous Variables

Goal: Show relationship

Basic relationship: Scatter plot
Many points (overlap): Hex plot, 2D density
Add trend: Add geom_smooth()
Compare groups: Color by group, facet by group

4. Two Categorical Variables

Goal: Show associations

Frequencies: Bar plot (grouped or stacked)
Proportions: 100% stacked bar, mosaic plot
Statistical test: Association plot

5. Time Series

Goal: Show change over time

Discrete time points: Line graph with points
Continuous time: Smoothed line, ribbon plot
Multiple series: Colored lines, small multiples
Uncertainty: Ribbon plot, error bars

6. Three+ Variables

Goal: Show multivariate relationships

Third variable categorical: Color/shape, facets
Third variable continuous: Color gradient, bubble size
Many variables: Heatmap, parallel coordinates

Common Scenarios and Solutions

Scenario 1: Survey Results

Data: Likert scale responses from 5 groups

Options:
1. gglikert plot (best for multiple questions)
2. Stacked bar chart (100% for proportions)
3. Faceted bar charts (best for comparing specific responses)

Choose based on:
- Number of questions (many → gglikert)
- Focus on specific categories (faceted bars)
- Showing overall sentiment (stacked bars)

Scenario 2: Experimental Results

Data: Measurements from treatment and control groups

Options:
1. Boxplots (show distributions + outliers)
2. Violin plots (show distribution shape)
3. Bar plot with error bars (show means + uncertainty)

Choose based on:
- Sample size (small → dot plot, large → violin)
- Presence of outliers (boxplot shows these)
- Simplicity needed (bar + error = simplest)

Scenario 3: Geographic Data

Data: Values across countries/regions

Options:
1. Map (when geography matters)
2. Bar plot with flags (when ranking matters)
3. Dot plot (when precision matters)

Choose based on:
- Audience familiarity with geography
- Whether spatial patterns matter
- Number of regions (too many for map)

Exercise 8.1: Plot Selection Challenge

Real-World Scenarios

For each scenario, choose the best plot type and explain why:

Scenario A: You have test scores (0-100) for students in 4 different teaching methods. You want to know if methods differ significantly.

Scenario B: You’ve measured reaction times (milliseconds) in 20 trials for each of 50 participants.

Scenario C: You surveyed 200 people about their agreement (5-point scale) with 10 statements about climate change.

Scenario D: You have daily temperature readings for 5 cities over one year.

For each:
1. What plot type would you use?
2. What alternatives did you consider?
3. What would make you change your choice?

Common Mistakes to Avoid

❌ Mistake 1: 3D Charts

Problem: Hard to read, distort data

Code

# DON'T DO THIS
# 3D plots are almost never appropriate for data visualization

Instead: Use 2D charts with proper grouping/faceting

❌ Mistake 2: Dual Y-Axes

Problem: Can be misleading, hard to interpret

Instead:
- Facet plots (separate panels)
- Normalize to same scale
- Use secondary metric only if essential

❌ Mistake 3: Too Many Colors

Problem: Confusing, hard to distinguish

Instead:
- Limit to 5-7 colors
- Use ColorBrewer palettes
- Consider faceting instead

❌ Mistake 4: Truncated Y-Axis (Bar Plots)

Problem: Exaggerates differences

Rule: Bar plots should always start at zero

Exception: Dot plots with error bars can use truncated axes

❌ Mistake 5: Chartjunk

Problem: Decoration distracts from data

Avoid:
- Unnecessary grid lines
- Decorative backgrounds
- 3D effects
- Shadows and gradients (usually)

Instead: Use theme_minimal() or theme_bw() as starting points

The Grammar of Graphics Framework

ggplot2 is based on “The Grammar of Graphics” - understanding this helps you think about plots systematically.

Every plot has: 1. Data - What you’re visualizing 2. Aesthetics (aes) - What goes where (x, y, color, size, etc.) 3. Geometries (geom) - How to display it (points, lines, bars, etc.) 4. Scales - How aesthetics map to visual properties 5. Facets - Subplots 6. Themes - Non-data visual elements

Building blocks:

Code

ggplot(data = <DATA>) +
  aes(x = <X>, y = <Y>, color = <GROUP>) +  # Aesthetics
  geom_<TYPE>() +                            # Geometry
  scale_<AESTHETIC>_<TYPE>() +               # Scales
  facet_<TYPE>(vars(<VARIABLE>)) +           # Facets
  theme_<STYLE>() +                          # Theme
  labs(title = <TITLE>, ...)                 # Labels

This modular approach lets you build any plot by combining these components!

Final Challenge: Capstone Project

Comprehensive Data Visualization Project

You’ve learned all the essential techniques. Now put them together!

Your Task

Create a complete data story using the pdat dataset (or your own data). Your project should include:

Required Components:

At least 3 different plot types from different sections:
- One showing distributions
- One showing relationships
- One showing categorical comparisons
Publication-ready quality:
- Proper titles, labels, and captions
- Colorblind-friendly palette
- Appropriate themes
- Clear legends
A narrative:
- 2-3 paragraph introduction explaining your question
- Transition text between plots explaining what each shows
- 2-3 paragraph conclusion summarizing findings
Technical elements:
- At least one faceted plot
- At least one customized plot (colors, themes, labels)
- Proper use of aesthetics (color, shape, size)

Example Questions to Explore

How has language use evolved across different genres over time?
Are there regional differences in writing styles?
What patterns exist in the data that might surprise a linguist?
Can you predict time period based on linguistic features?

Deliverables

R Markdown document with all code and narrative
3-5 high-quality figures saved as PNG (300 dpi)
One “highlight figure” that tells your main story

Evaluation Criteria

Your project will be strong if it:
- ✅ Chooses appropriate plot types for each question
- ✅ Uses visualization best practices (clear labels, readable fonts, etc.)
- ✅ Tells a coherent story with the data
- ✅ Shows technical mastery of ggplot2
- ✅ Includes thoughtful interpretation of results
- ✅ Is reproducible (all code runs without errors)

Bonus points for:
- Creative combinations of techniques
- Particularly insightful findings
- Exceptional visual design
- Going beyond the tutorial examples

Resources and Next Steps

Recommended Books

“ggplot2: Elegant Graphics for Data Analysis” by Hadley Wickham
- The definitive ggplot2 guide
- Online version
“Data Visualization: A Practical Introduction” by Kieran Healy
- Excellent for understanding principles
- Sociology focus but broadly applicable
“Fundamentals of Data Visualization” by Claus Wilke
- Free online: https://clauswilke.com/dataviz/
- Best for understanding when to use each plot type

Online Resources

Interactive Learning:
- R Graph Gallery - Hundreds of examples with code
- Data to Viz - Decision tree for choosing plots
- From Data to Viz - Interactive explorer

Reference:
- ggplot2 documentation
- R Color Reference
- ColorBrewer - Choose palettes

Advanced Topics:
- Patchwork - Combining multiple plots
- gganimate - Animated visualizations
- plotly - Interactive plots
- rayshader - 3D visualizations (when appropriate!)

Cheat Sheets

Download and print these:
- ggplot2 cheat sheet
- RStudio IDE cheat sheet

Common Problems and Solutions

“My plot is too crowded”

Solutions:
- Facet into multiple panels
- Filter to top N categories
- Use color to highlight key groups
- Try a different plot type (e.g., heatmap instead of scatter)

“Colors look different in different programs”

Solutions:
- Use colorblind-safe palettes
- Test in target environment
- Save as PDF (preserves colors better)
- Specify colors explicitly with hex codes

“Text overlaps in my plot”

Solutions:
- Rotate labels: theme(axis.text.x = element_text(angle = 45, hjust = 1))
- Use ggrepel::geom_text_repel()
- Reduce number of labels
- Increase plot size
- Abbreviate labels

“Error: object not found”

Solutions:
- Check spelling of variable names
- Ensure data is loaded
- Check if library is loaded
- Use str(data) to see variable names

“Plot looks pixelated”

Solutions:
- Increase DPI: ggsave(..., dpi = 300)
- Save as PDF (vector format)
- Increase figure size
- Avoid resizing after saving

Where to Get Help

Stack Overflow: Tag your question with [r] and [ggplot2]
RStudio Community: https://community.rstudio.com/
R for Data Science Slack: https://www.rfordatasci.com/
Twitter #rstats: Active, helpful community

Practice Datasets

To continue learning, try these datasets:

Built into R:
- mpg - Fuel economy data
- diamonds - Diamond prices and properties
- economics - US economic time series
- midwest - Demographic data

From packages:
- gapminder - Global health and wealth
- nycflights13 - Flight data
- fivethirtyeight - Data from news articles
- palmerpenguins - Alternative to iris dataset

Your Learning Path

Beginner → Intermediate:
1. ✅ Master basic geoms (point, line, bar, box)
2. ✅ Understand aesthetics and mapping
3. ✅ Learn faceting
4. ✅ Customize themes
5. ⬜ Combine multiple plots (patchwork)
6. ⬜ Create custom themes
7. ⬜ Build functions for repeated plots

Intermediate → Advanced:
1. ⬜ Master scales and coordinates
2. ⬜ Custom annotations
3. ⬜ Statistical transformations
4. ⬜ Extension packages (gganimate, ggraph, etc.)
5. ⬜ Interactive visualizations (plotly)
6. ⬜ Creating your own geoms
7. ⬜ Publication-ready figure workflows

Citation & Session Info

Schweinberger, Martin. 2025. Mastering Data Visualization with R. Brisbane: The University of Queensland. url: https://ladal.edu.au/tutorials/dviz/dviz.html (Version 2025.02.07).

@manual{schweinberger2026dviz,
  author = {Schweinberger, Martin},
  title = {Mastering Data Visualization with R},
  note = {https://ladal.edu.au/tutorials/dviz/dviz.html},
  year = {2026},
  organization = {The University of Queensland, School of Languages and Cultures},
  address = {Brisbane},
  edition = {2026.02.07}
}

Session Information

Code

sessionInfo()

R version 4.4.2 (2024-10-31 ucrt)
Platform: x86_64-w64-mingw32/x64
Running under: Windows 11 x64 (build 26200)

Matrix products: default


locale:
[1] LC_COLLATE=English_United States.utf8 
[2] LC_CTYPE=English_United States.utf8   
[3] LC_MONETARY=English_United States.utf8
[4] LC_NUMERIC=C                          
[5] LC_TIME=English_United States.utf8    

time zone: Australia/Brisbane
tzcode source: internal

attached base packages:
[1] grid      stats     graphics  grDevices datasets  utils     methods  
[8] base     

other attached packages:
 [1] viridis_0.6.5           viridisLite_0.4.2       quanteda.textplots_0.95
 [4] quanteda_4.2.0          scales_1.3.0            ggstats_0.10.0         
 [7] ggflags_0.0.4           ggstatsplot_0.13.0      EnvStats_3.0.0         
[10] gridExtra_2.3           vip_0.4.1               PMCMRplus_1.9.12       
[13] rstantools_2.4.0        hexbin_1.28.5           flextable_0.9.7        
[16] tidyr_1.3.1             ggridges_0.5.6          tm_0.7-16              
[19] NLP_0.3-2               vcd_1.4-13              likert_1.3.5           
[22] xtable_1.8-4            ggplot2_3.5.1           stringr_1.5.1          
[25] dplyr_1.1.4            

loaded via a namespace (and not attached):
  [1] rstudioapi_0.17.1       jsonlite_1.9.0          datawizard_1.0.0       
  [4] correlation_0.8.6       magrittr_2.0.3          TH.data_1.1-3          
  [7] estimability_1.5.1      SuppDists_1.1-9.8       farver_2.1.2           
 [10] rmarkdown_2.30          ragg_1.3.3              vctrs_0.6.5            
 [13] memoise_2.0.1           paletteer_1.6.0         askpass_1.2.1          
 [16] base64enc_0.1-6         effectsize_1.0.0        htmltools_0.5.9        
 [19] BWStest_0.2.3           Formula_1.2-5           htmlwidgets_1.6.4      
 [22] plyr_1.8.9              sandwich_3.1-1          emmeans_1.10.7         
 [25] zoo_1.8-13              cachem_1.1.0            uuid_1.2-1             
 [28] lifecycle_1.0.4         iterators_1.0.14        pkgconfig_2.0.3        
 [31] Matrix_1.7-2            R6_2.6.1                fastmap_1.2.0          
 [34] digest_0.6.39           colorspace_2.1-1        rematch2_2.1.2         
 [37] patchwork_1.3.0         textshaping_1.0.0       Hmisc_5.2-2            
 [40] labeling_0.4.3          compiler_4.4.2          fontquiver_0.2.1       
 [43] withr_3.0.2             backports_1.5.0         htmlTable_2.4.3        
 [46] psych_2.4.12            MASS_7.3-61             openssl_2.3.2          
 [49] tools_4.4.2             foreign_0.8-87          lmtest_0.9-40          
 [52] stopwords_2.3           zip_2.3.2               statsExpressions_1.6.2 
 [55] nnet_7.3-19             glue_1.8.0              nlme_3.1-166           
 [58] checkmate_2.3.2         cluster_2.1.6           reshape2_1.4.4         
 [61] generics_0.1.3          gtable_0.3.6            data.table_1.17.0      
 [64] xml2_1.3.6              foreach_1.5.2           pillar_1.10.1          
 [67] splines_4.4.2           lattice_0.22-6          renv_1.1.1             
 [70] survival_3.7-0          gmp_0.7-5               tidyselect_1.2.1       
 [73] fontLiberation_0.1.0    knitr_1.51              fontBitstreamVera_0.1.1
 [76] xfun_0.56               stringi_1.8.4           yaml_2.3.10            
 [79] evaluate_1.0.3          codetools_0.2-20        kSamples_1.2-10        
 [82] officer_0.6.7           gdtools_0.4.1           tibble_3.2.1           
 [85] multcompView_0.1-10     cli_3.6.4               RcppParallel_5.1.10    
 [88] rpart_4.1.23            parameters_0.24.1       systemfonts_1.2.1      
 [91] munsell_0.5.1           Rcpp_1.0.14             zeallot_0.1.0          
 [94] coda_0.19-4.1           parallel_4.4.2          bayestestR_0.15.2      
 [97] Rmpfr_1.0-0             mvtnorm_1.3-3           slam_0.1-55            
[100] insight_1.0.2           purrr_1.0.4             rlang_1.1.7            
[103] fastmatch_1.1-6         multcomp_1.4-28         mnormt_2.1.1

Acknowledgments

This tutorial builds on the excellent work of the R and tidyverse communities. Special thanks to:

Hadley Wickham for creating ggplot2
The RStudio team for tools and resources
All package authors cited throughout
The LADAL team for supporting this tutorial

Back to top

Back to HOME

Quick Reference Tables

Common Geoms Reference

Geom	Use For	Example
`geom_point()`	Scatter plots	Relationship between 2 continuous variables
`geom_line()`	Line graphs	Time series, trends
`geom_bar()`	Bar plots	Categorical frequencies
`geom_boxplot()`	Boxplots	Distribution summaries
`geom_violin()`	Violin plots	Distribution shapes
`geom_histogram()`	Histograms	Single variable distributions
`geom_density()`	Density plots	Smooth distributions
`geom_smooth()`	Trend lines	Adding regression/smoothing
`geom_errorbar()`	Error bars	Showing uncertainty
`geom_tile()`	Heatmaps	Matrix visualizations
`geom_hex()`	Hex bins	Large scatter plots
`geom_density_2d()`	2D density	Concentration in 2D

Common Aesthetics

Aesthetic	Controls	Example Variables
`x`	X-axis position	Continuous or categorical
`y`	Y-axis position	Continuous or categorical
`color`	Border/line color	Groups, categories
`fill`	Fill color	Groups (for bars, boxes, etc.)
`size`	Point/line size	Continuous variables
`shape`	Point shape	Categories (max ~6)
`alpha`	Transparency	Continuous (0-1)
`linetype`	Line type	Categories

Common Themes

Theme	Description
`theme_bw()`	Black and white, minimal
`theme_minimal()`	Minimal theme, no background
`theme_classic()`	Classic look, axis lines
`theme_void()`	Empty theme
`theme_dark()`	Dark background
`theme_grey()`	Default ggplot2 theme

Position Adjustments

Position	Use For
`position_dodge()`	Side-by-side bars
`position_stack()`	Stacked bars/areas
`position_fill()`	100% stacked
`position_jitter()`	Avoid overplotting
`position_identity()`	Use exact values

Remember: The best visualization is the one that clearly communicates your message to your audience! 📊

--- title: "Mastering Data Visualization with R" author: "Martin Schweinberger" format: html: toc: true toc-depth: 3 code-fold: show code-tools: true theme: cosmo --- ![](/images/uq1.jpg){ width=100% } # Welcome! {.unnumbered} ::: {.callout-tip} ## What You'll Learn By the end of this tutorial, you will be able to: - Choose the right visualization type for your data and research question - Create publication-quality plots using ggplot2 - Customize visualizations to tell compelling data stories - Apply best practices for effective data communication - Build complex, multi-layered visualizations step-by-step ::: ## Who This Tutorial Is For This tutorial is designed for: - **Beginners** who want to learn data visualization from scratch - **Intermediate R users** looking to enhance their plotting skills - **Researchers** who need to create professional visualizations for publications - Anyone interested in telling stories with data ## Prerequisites <div class="warning"> <span> <p style='margin-top:1em; text-align:center'> **Before starting, make sure you're familiar with:**<br> </p> <p style='margin-top:1em; text-align:left'> <ul> <li>[Getting started with R](/tutorials/intror/intror.html) </li> <li>[Loading, saving, and generating data in R](/tutorials/load/load.html) </li> <li>[Handling Tables in R](/tutorials/table/table.html) </li> </ul> </p> </span> </div> ## Tutorial Structure This tutorial follows a **learn-by-doing** approach with three main components: 1. **Concept explanations** - Understanding when and why to use each visualization 2. **Step-by-step examples** - Building plots from simple to complex 3. **Hands-on exercises** - Practice what you've learned immediately ::: {.callout-note} ## Learning Philosophy Rather than showing you every possible option at once, we'll build complexity gradually. Each section introduces new concepts that build on what you've learned before. ::: # Setup and Preparation {#setup} ## Installing Required Packages First, let's install all the packages we'll need. Run this code once - it may take 3-5 minutes: ```{r prep1, echo=T, eval = F} # Install core packages install.packages("dplyr") # Data manipulation install.packages("stringr") # String processing install.packages("ggplot2") # Core plotting package install.packages("tidyr") # Data reshaping install.packages("scales") # Scale functions for ggplot2 # Install specialized plotting packages install.packages("ggridges") # Ridge plots install.packages("ggstats") # Statistical plots install.packages("ggstatsplot")# Statistical visualizations install.packages("EnvStats") # Environmental statistics # Install packages for specific plot types install.packages("likert") # Likert scale visualizations install.packages("vcd") # Categorical data visualization install.packages("hexbin") # Hexagonal binning install.packages("gridExtra") # Arranging multiple plots # Install utility packages install.packages("flextable") # Pretty tables install.packages("devtools") # For installing from GitHub # Install ggflags from GitHub (for country flags in plots) devtools::install_github("jimjam-slam/ggflags") ``` ## Loading Packages Now activate the packages for this session: ```{r prep2, message=FALSE, warning=FALSE, class.source='klippy'} library(dplyr) library(stringr) library(ggplot2) library(tidyr) library(flextable) library(hexbin) library(gridExtra) library(ggflags) library(ggstats) library(ggridges) library(EnvStats) library(scales) ``` ::: {.callout-tip} ## Pro Tip Create a standard R script with these library calls that you can run at the start of each data visualization session! ::: ## Loading the Data We'll work with a dataset about preposition usage in historical English texts: ```{r prep4} # Load data pdat <- base::readRDS("tutorials/dviz/data/pvd.rda", "rb") ``` Let's examine the structure of our data: ```{r prep5, echo = F} # Display first 15 rows pdat |> as.data.frame() |> head(15) |> flextable::flextable() |> flextable::set_table_properties(width = .95, layout = "autofit") |> flextable::theme_zebra() |> flextable::fontsize(size = 12) |> flextable::fontsize(size = 12, part = "header") |> flextable::align_text_col(align = "center") |> flextable::set_caption(caption = "First 15 rows of the pdat data.") |> flextable::border_outer() ``` ### Understanding Our Data Our dataset contains: - **Date**: When the text was written - **Genre**: Type of text (Fiction, Legal, Religious, etc.) - **Text**: Name of the source text - **Prepositions**: Relative frequency of prepositions (per 1,000 words) - **Region**: Geographic location (North/South) - **GenreRedux**: Simplified genre categories - **DateRedux**: Time periods (1150-1499, 1500-1599, etc.) ## Setting Up a Color Palette Let's create a consistent color scheme for our visualizations: ```{r prep6} # Define custom colors clrs <- c("purple", "gray80", "lightblue", "orange", "gray30") ``` ::: {.callout-note} ## Why Custom Colors? Using a consistent color palette across all your visualizations: - Creates a professional, cohesive look - Makes your work more recognizable - Ensures color accessibility - Saves time (no need to specify colors each time) Explore more color options: - [R Color Reference](http://www.stat.columbia.edu/~tzheng/files/Rcolor.pdf) - [R Color Palettes](https://www.datanovia.com/en/blog/top-r-color-palettes-to-know-for-great-data-visualization/) ::: --- # Part 1: Exploring Relationships {#part1} In this section, we'll learn to visualize relationships between variables. We'll start simple and gradually add complexity. ## Scatter Plots: The Foundation {#scatter} **When to use scatter plots:** To show the relationship between two continuous (numeric) variables. **Research questions answered:** - Is there a relationship between X and Y? - Does the relationship vary by group? - Are there outliers or unusual patterns? ### Building Your First Scatter Plot Let's create a basic scatter plot step by step: ```{r scatter_basic, results = 'asis', message=FALSE, warning=FALSE} # Step 1: Most basic scatter plot ggplot(data = pdat, # Our dataset aes(x = Date, # X-axis variable y = Prepositions)) + # Y-axis variable geom_point() # Add points ``` ::: {.callout-note} ## Understanding the Code - `ggplot()`: Initialize the plot - `aes()`: Define "aesthetics" (what goes where) - `geom_point()`: Add a layer of points - `+`: Add layers together (like building blocks!) ::: ### Exercise 1.1: Your First Plot {.exercise} ::: {.callout-warning icon=false} ## Try It Yourself! Create a scatter plot showing the relationship between `Date` (x-axis) and `Prepositions` (y-axis) using the code above. **Questions to consider:** 1. What pattern do you see? 2. Are prepositions becoming more or less frequent over time? 3. Is the relationship linear or does it curve? ::: ### Adding Color: Visualizing Groups Now let's add color to distinguish between genres: ```{r scatter_color, eval = T} ggplot(pdat, aes(x = Date, y = Prepositions, color = GenreRedux)) + # Color by genre geom_point() + theme_bw() # Clean black & white theme ``` **What changed?** - `color = GenreRedux` inside `aes()` colors points by genre - `theme_bw()` gives us a cleaner, professional look - ggplot2 automatically creates a legend! ### Customizing Colors and Shapes Let's make our plot publication-ready: ```{r scatter_custom, eval = T} ggplot(pdat, aes(Date, Prepositions, color = GenreRedux, shape = GenreRedux)) + # Different shapes for genres geom_point(size = 2) + # Larger points scale_shape_manual( name = "Genre", values = 1:5 # Different point shapes ) + scale_color_manual( name = "Genre", values = clrs # Our custom colors ) + theme_bw() + theme(legend.position = "top") # Move legend to top ``` ::: {.callout-tip} ## Design Principle: Redundant Encoding Using both color AND shape to show genre makes your plot more accessible: - People with color blindness can use shapes - Black & white printing preserves information - Easier to distinguish groups when many overlap ::: ### Exercise 1.2: Customize Your Plot {.exercise} ::: {.callout-warning icon=false} ## Challenge Modify the plot above to: 1. Change the theme to `theme_minimal()` or `theme_classic()` 2. Move the legend to the bottom 3. Try different point sizes (hint: change the `size` parameter) **Bonus:** Try `theme_void()` - what happens? Why might this be useful (or not)? ::: ## Adding Statistical Layers ### Trend Lines: Seeing Patterns Let's add trend lines to see patterns more clearly: ```{r scatter_trends, message=F, warning=F} ggplot(pdat, aes(Date, Prepositions, color = Genre)) + facet_wrap(vars(Genre), ncol = 4) + # Separate panel per genre geom_point(alpha = 0.5) + # Semi-transparent points geom_smooth(method = "lm", se = FALSE) + # Linear trend line theme_bw() + theme( legend.position = "none", # No legend needed (titles show genre) axis.text.x = element_text(size = 8, angle = 90) ) ``` **New concepts:** - `facet_wrap()`: Create separate panels for each group - `alpha = 0.5`: Make points semi-transparent (50% opacity) - `geom_smooth()`: Add a smoothed trend line - `method = "lm"`: Use linear regression - `se = FALSE`: Don't show confidence interval ::: {.callout-note} ## When to Use Facets Facets (separate panels) work best when: - You have 3-8 groups to compare - Patterns within groups are important - Overlapping points make one plot hard to read Avoid facets when: - You need to directly compare values across groups - You have too many groups (>10) ::: ### Exercise 1.3: Exploring Trends {.exercise} ::: {.callout-warning icon=false} ## Analysis Task Using the faceted plot above: 1. Which genre shows the strongest trend over time? 2. Which genres have increasing vs. decreasing preposition use? 3. Try changing `method = "lm"` to `method = "loess"` - what's different? **Discussion:** When might a curved line (loess) be more appropriate than a straight line (lm)? ::: ## Density Overlays: Alternative to Points Sometimes you have too many overlapping points. Here's an alternative: ```{r scatter_density, eval = T} ggplot(pdat, aes(x = Date, y = Prepositions, color = GenreRedux)) + facet_wrap(vars(GenreRedux), ncol = 5) + geom_density_2d() + # 2D density contours theme_bw() + theme( legend.position = "none", axis.text.x = element_text(size = 8, angle = 90) ) ``` **What are density contours?** Think of them like topographic map lines - they show where data points are concentrated. ### Quick Comparison Table | Visualization | Best For | Limitations | |--------------|----------|-------------| | Points | Small-medium datasets, seeing all data | Gets messy with many points | | Trend lines | Showing overall patterns | Hides individual variation | | Density contours | Large datasets, concentration patterns | Harder to interpret | | Hex bins (next!) | Very large datasets | Requires uniform X-Y scales | ## Hex Plots: Handling Big Data When you have thousands of points, hex plots show density efficiently: ```{r hex_plot, results = 'asis', message=FALSE, warning=FALSE} pdat |> ggplot(aes(x = Date, y = Prepositions)) + geom_hex() + # Hexagonal binning scale_fill_gradient(low = "lightblue", high = "darkblue") + theme_bw() ``` Darker hexagons = more data points in that region. ### Exercise 1.4: Comparing Approaches {.exercise} ::: {.callout-warning icon=false} ## Synthesis Challenge Create three plots of the same data: 1. A scatter plot with `geom_point()` 2. A density plot with `geom_density_2d()` 3. A hex plot with `geom_hex()` **Reflect:** - What different insights does each provide? - Which would you use in a paper? A presentation? An exploratory analysis? ::: --- # Part 2: Showing Distributions {#part2} Understanding distributions helps us see patterns, outliers, and the "shape" of our data. ## Density Plots: Smooth Distribution Curves {#density} **When to use:** To show how values are distributed, especially comparing groups. ```{r density_basic, results = 'asis', message=FALSE, warning=FALSE} ggplot(pdat, aes(Date, fill = Region)) + geom_density(alpha = 0.5) + # Semi-transparent densities scale_fill_manual(values = clrs[1:2]) + theme_bw() + theme(legend.position = c(0.1, 0.9)) # Position inside plot area ``` **Reading density plots:** - X-axis: Values of the variable (Date) - Y-axis: Density (higher = more data points) - Peaks: Most common values - Width: Spread of the data ::: {.callout-tip} ## Interpreting This Plot The plot shows that: - Southern texts continue into the 1800s - Northern texts end around 1700 - There's an overlap period where both regions produced texts ::: ### Exercise 2.1: Distribution Detective {.exercise} ::: {.callout-warning icon=false} ## Investigation Create a density plot of `Prepositions` (not `Date`), colored by `GenreRedux`. **Questions:** 1. Which genre has the highest average preposition frequency? 2. Which genre shows the most variation (widest distribution)? 3. Do any genres have unusual distributions (multiple peaks, asymmetry)? ::: ## Histograms: Counting in Bins {#histograms} Histograms are similar to density plots but show actual counts: ```{r hist_basic, message=F, warning=F} ggplot(pdat, aes(Prepositions)) + geom_histogram(bins = 30, # Number of bins fill = "steelblue", color = "white") + # Outline color theme_bw() + labs(title = "Distribution of Preposition Frequencies", x = "Prepositions per 1,000 words", y = "Count") ``` ### Comparing Groups with Histograms ```{r hist_groups, message=F, warning=F} ggplot(pdat, aes(Prepositions, fill = Region)) + geom_histogram(bins = 30, alpha = 0.6, position = "identity") + scale_fill_manual(values = clrs[1:2]) + theme_bw() + theme(legend.position = "top") ``` ::: {.callout-important} ## Histogram vs. Bar Plot **Don't confuse these!** - **Histogram**: Shows distribution of ONE continuous variable (bins are ranges) - **Bar plot**: Shows counts/values for CATEGORIES (bars are discrete groups) ::: ### Exercise 2.2: Finding the Right Bin Width {.exercise} ::: {.callout-warning icon=false} ## Experiment Create three histograms of `Prepositions` with different numbers of bins: 1. `bins = 10` 2. `bins = 30` 3. `bins = 100` **Discuss:** - Too few bins: What information is lost? - Too many bins: What problems arise? - How do you choose the "right" number? **Hint:** Try the Freedman-Diaconis rule: `bins = 30` is often a good starting point. ::: ## Ridge Plots: Beautiful Distribution Comparisons {#ridges} Ridge plots elegantly show multiple distributions: ```{r ridge_basic, results = 'asis', message=FALSE, warning=FALSE} library(ggridges) pdat |> ggplot(aes(x = Prepositions, y = GenreRedux, fill = GenreRedux)) + geom_density_ridges() + theme_ridges() + theme(legend.position = "none") + labs(y = "", x = "Relative frequency of prepositions") ``` **Why ridge plots are great:** - Easy to compare shapes across many groups - Aesthetically pleasing - Popular in modern data visualization ### Exercise 2.3: Ridge Plot Exploration {.exercise} ::: {.callout-warning icon=false} ## Create and Customize 1. Create a ridge plot of `Prepositions` by `DateRedux` (instead of `GenreRedux`) 2. Add color with `scale_fill_manual(values = clrs)` 3. Try `geom_density_ridges(alpha = 0.6, stat = "binline", bins = 20)` - what changes? **Bonus:** Research what `stat = "binline"` does. Why might you choose this over smooth densities? ::: ## Boxplots: The Statistical Summary {#boxplots} Boxplots show five key statistics at once: ```{r box_basic, results = 'asis', message=FALSE, warning=FALSE} ggplot(pdat, aes(DateRedux, Prepositions, fill = DateRedux)) + geom_boxplot() + scale_fill_manual(values = clrs) + theme_bw() + theme(legend.position = "none") + labs(x = "Time Period", y = "Prepositions (per 1,000 words)") ``` ### Reading a Boxplot ![Anatomy of a boxplot - showing median, quartiles, whiskers, and outliers] - **Line in box**: Median (50th percentile) - **Box**: Interquartile range (IQR) - middle 50% of data - **Whiskers**: Extend to 1.5 × IQR - **Dots**: Outliers beyond whiskers ### Notched Boxplots: Testing Differences ```{r box_notched, results = 'asis', message=FALSE, warning=FALSE} ggplot(pdat, aes(DateRedux, Prepositions, fill = DateRedux)) + geom_boxplot(notch = TRUE, # Add notches outlier.colour = "red", outlier.shape = 2, outlier.size = 3) + scale_fill_manual(values = clrs) + theme_bw() + theme(legend.position = "none") ``` ::: {.callout-important} ## The Notch Test If notches of two boxes don't overlap → strong evidence groups differ significantly. This is a visual "rough test" - not a replacement for proper statistics! ::: ### Enhanced Boxplots with Individual Points ```{r box_enhanced, results = 'asis', message=FALSE, warning=FALSE} library(EnvStats) ggplot(pdat, aes(DateRedux, Prepositions, fill = DateRedux, color = DateRedux)) + geom_boxplot(varwidth = TRUE, # Width proportional to sample size color = "black", alpha = 0.3) + geom_jitter(alpha = 0.3, # Add individual points height = 0, # Don't jitter vertically width = 0.2) + # Small horizontal spread facet_grid(~Region) + EnvStats::stat_n_text(y.pos = 65) + # Add sample sizes theme_bw() + theme(legend.position = "none") + labs(x = "", y = "Frequency (per 1,000 words)", title = "Preposition Use Across Time and Regions") ``` ### Exercise 2.4: Boxplot Mastery {.exercise} ::: {.callout-warning icon=false} ## Advanced Challenge 1. Create a boxplot of `Prepositions` by `GenreRedux` 2. Add notches 3. Add jittered points 4. Color by genre 5. Add appropriate labels **Analysis questions:** - Which genres show the most variation? - Are there any outliers? What might they represent? - Do any genre pairs show non-overlapping notches? ::: ## Violin Plots: Best of Both Worlds Violin plots combine boxplot statistics with density shapes: ```{r violin_basic, results = 'asis', message=FALSE, warning=FALSE} ggplot(pdat, aes(DateRedux, Prepositions, fill = DateRedux)) + geom_violin(trim = FALSE, alpha = 0.5) + scale_fill_manual(values = clrs) + theme_bw() + theme(legend.position = "none") ``` **Violin plots show:** - Distribution shape (like density plots) - Median and quartiles (like boxplots) - Multimodal distributions (multiple peaks) ### When to Choose Each Plot Type | Plot Type | Best For | Avoid When | |-----------|----------|-----------| | Histogram | Single variable, showing counts | Comparing many groups | | Density | Smooth distributions, comparisons | Need exact counts | | Ridge | Many groups, emphasis on shapes | <3 groups | | Boxplot | Statistical summary, outliers | Distribution shape matters | | Violin | Shape + summary, detecting multimodality | Small sample sizes | ### Exercise 2.5: Distribution Showdown {.exercise} ::: {.callout-warning icon=false} ## Comparative Analysis For the variable `Prepositions` grouped by `GenreRedux`, create: 1. A ridge plot 2. A boxplot 3. A violin plot **Reflection:** - What does each reveal that the others don't? - If you could only show ONE plot in a paper, which would you choose and why? - How does sample size affect each plot type? ::: --- # Part 3: Categorical Data {#part3} Working with categorical variables requires different approaches. Let's explore the options! ## Bar Plots: The Workhorse of Categories {#barplots} First, let's create summary data: ```{r bar_data, message=F, warning=F} bdat <- pdat |> dplyr::mutate(DateRedux = factor(DateRedux)) |> group_by(DateRedux) |> dplyr::summarise(Frequency = n()) |> dplyr::mutate(Percent = round(Frequency / sum(Frequency) * 100, 1)) # View the data bdat ``` ### Basic Bar Plot ```{r bar_basic, results='hide', message=FALSE, warning=FALSE} ggplot(bdat, aes(DateRedux, Percent, fill = DateRedux)) + geom_bar(stat = "identity") + # Use actual values geom_text(aes(y = Percent - 3, # Position labels label = paste0(Percent, "%")), color = "white", size = 4) + scale_fill_manual(values = clrs) + theme_bw() + theme(legend.position = "none") + labs(x = "Time Period", y = "Percentage of Documents", title = "Distribution of Texts Across Time Periods") ``` ::: {.callout-note} ## `stat = "identity"` Explained - `geom_bar()` by default counts occurrences (`stat = "count"`) - Use `stat = "identity"` when your data already contains the values to plot - Think: "plot the values AS IS (their identity)" ::: ### Grouped Bar Plots ```{r bar_grouped, results='hide', message=FALSE, warning=FALSE} ggplot(pdat, aes(Region, fill = DateRedux)) + geom_bar(position = position_dodge(), # Side-by-side bars stat = "count") + scale_fill_manual(values = clrs) + theme_bw() + labs(x = "Region", y = "Number of Documents", fill = "Time Period") ``` **When to use grouped bars:** - Comparing sub-categories within main categories - 2-3 sub-groups work best - Direct comparison between groups is important ### Stacked Bar Plots ```{r bar_stacked, results='hide', message=FALSE, warning=FALSE} ggplot(pdat, aes(DateRedux, fill = GenreRedux)) + geom_bar(stat = "count") + scale_fill_manual(values = clrs) + theme_bw() + labs(x = "Time Period", y = "Number of Documents", fill = "Genre", title = "Genre Composition Across Time Periods") ``` ### Normalized Stacked Bars (100%) ```{r bar_normalized, results='hide', message=FALSE, warning=FALSE} ggplot(pdat, aes(DateRedux, fill = GenreRedux)) + geom_bar(stat = "count", position = "fill") + scale_fill_manual(values = clrs) + scale_y_continuous(labels = scales::percent) + # Format as percentages theme_bw() + labs(x = "Time Period", y = "Proportion of Documents", fill = "Genre", title = "Relative Genre Composition Over Time") ``` ::: {.callout-tip} ## Choosing Bar Plot Types **Grouped bars** when: - Comparing specific values across groups - You have 2-3 subgroups - Actual counts matter **Stacked bars** when: - Showing composition (parts of a whole) - Total amount is important - You have 3-6 subgroups **100% stacked** when: - Only proportions matter (not absolute values) - Emphasizing compositional changes ::: ### Exercise 3.1: Bar Plot Practice {.exercise} ::: {.callout-warning icon=false} ## Build Your Skills 1. Create a grouped bar plot showing `GenreRedux` by `Region` 2. Create a stacked bar plot of the same data 3. Create a 100% stacked version **Questions:** - Which plot makes it easiest to compare genre frequencies between regions? - Which shows total document counts best? - What story does the 100% stacked version tell? ::: ## Likert Scale Visualizations {#likert} Survey data with Likert scales (Strongly Disagree → Strongly Agree) needs special treatment. First, let's load some survey data: ```{r likert_data} ldat <- base::readRDS("tutorials/dviz/data/lid.rda", "rb") head(ldat) ``` ### Method 1: Grouped Bar Plot ```{r likert_grouped, echo=T, message=FALSE, warning=FALSE} # Summarize the data nlik <- ldat |> dplyr::group_by(Course, Satisfaction) |> dplyr::summarize(Frequency = n()) # Create grouped bar plot ggplot(nlik, aes(Satisfaction, Frequency, fill = Course)) + geom_bar(stat = "identity", position = position_dodge()) + scale_fill_manual(values = clrs[1:3]) + geom_text(aes(label = Frequency), vjust = 1.6, color = "white", position = position_dodge(0.9), size = 3.5) + scale_x_discrete( limits = 1:5, labels = c("Very\nDissatisfied", "Dissatisfied", "Neutral", "Satisfied", "Very\nSatisfied") ) + theme_bw() + labs(title = "Student Satisfaction by Course", x = "Satisfaction Level", y = "Number of Students") ``` ### Method 2: Cumulative Line Graph ```{r likert_cumulative, warning=F, message=F} ggplot(ldat, aes(x = Satisfaction, color = Course)) + geom_step(aes(y = ..y..), stat = "ecdf", size = 1.5) + scale_colour_manual(values = clrs[1:3]) + scale_x_discrete( limits = 1:5, labels = c("Very\nDissatisfied", "Dissatisfied", "Neutral", "Satisfied", "Very\nSatisfied") ) + theme_bw() + labs(title = "Cumulative Satisfaction Distribution", y = "Cumulative Proportion", x = "Satisfaction Level") ``` ::: {.callout-note} ## Reading Cumulative Plots - Steeper lines = responses concentrated in that range - Higher line at left = more dissatisfied responses - Lines that cross = different distribution patterns - Gap between lines = difference in satisfaction ::: ### Method 3: gglikert (Modern Approach) ```{r likert_gglikert, warning=F, message=F} # Load survey data with multiple questions sdat <- base::readRDS("tutorials/dviz/data/sdd.rda", "rb") # Clean column names colnames(sdat)[3:ncol(sdat)] <- paste0( "Q", str_pad(1:10, 2, "left", "0"), ": ", colnames(sdat)[3:ncol(sdat)] ) |> stringr::str_replace_all("\\.", " ") |> stringr::str_squish() |> stringr::str_replace_all("$", "?") # Convert to factors with labels lbs <- c("Disagree", "Somewhat\nDisagree", "Neutral", "Somewhat\nAgree", "Agree") survey <- sdat |> dplyr::mutate_if(is.character, factor) |> dplyr::mutate_if(is.numeric, factor, levels = 1:5, labels = lbs) |> drop_na() |> as.data.frame() # Create gglikert plot survey |> dplyr::select(matches("01|02|03|04")) |> gglikert(labels_size = 2.5, add_labels = FALSE) + ggtitle("Survey Responses to Selected Questions") + scale_fill_brewer(palette = "RdBu") ``` ::: {.callout-tip} ## Likert Best Practices 1. **Order matters**: Keep response scales in order (don't sort by frequency) 2. **Neutral center**: Place neutral/midpoint in the middle 3. **Diverging colors**: Use colors that diverge from center (e.g., Red-Blue) 4. **Group facets**: Use for comparing sub-groups 5. **Consider n**: Show sample sizes when comparing groups ::: ### Exercise 3.2: Survey Visualization Challenge {.exercise} ::: {.callout-warning icon=false} ## Real-World Application Imagine you've surveyed 100 students about their experience in an online course. Create visualizations to show: 1. Overall satisfaction distribution (use `ldat` as an example) 2. Comparison between different courses 3. Which visualization would you use in: - An academic paper? - A presentation to administrators? - A quick report to instructors? **Reflect:** How does your choice of visualization affect the "story" the data tells? ::: ## Pie Charts: Use With Caution {#piecharts} ::: {.callout-warning} ## Design Warning Pie charts are popular but problematic: - Hard to compare slice sizes - Difficult to estimate percentages - Problematic with many categories - Bar plots almost always work better **When pies might be okay:** - Very few categories (2-3) - One category is dominant (~50%+) - Showing parts of a whole is crucial ::: Here's how to make one anyway (for comparison): ```{r pie_comparison, message=F, warning=F} # Create data for pie chart piedata <- bdat |> dplyr::arrange(desc(DateRedux)) |> dplyr::mutate(Position = cumsum(Percent) - 0.5 * Percent) # Create side-by-side comparison p1 <- ggplot(bdat, aes("", Percent, fill = DateRedux)) + geom_bar(stat = "identity", position = position_dodge(), width = 0.7) + scale_fill_manual(values = clrs) + theme_minimal() + labs(title = "Bar Plot", y = "Percent") p2 <- ggplot(piedata, aes("", Percent, fill = DateRedux)) + geom_bar(stat = "identity", width = 1, color = "white") + coord_polar("y", start = 0) + scale_fill_manual(values = clrs) + theme_void() + geom_text(aes(y = Position, label = paste0(Percent, "%")), color = "white", size = 4) + labs(title = "Pie Chart") gridExtra::grid.arrange(p1, p2, nrow = 1) ``` **Which is easier to interpret? Why?** ### Exercise 3.3: Pie vs. Bar Debate {.exercise} ::: {.callout-warning icon=false} ## Critical Thinking Look at the comparison above. 1. Without looking at the numbers, which time period has the highest percentage in the pie chart? 2. Try the same question with the bar plot. 3. Which differences are easier to see? **Challenge:** Find a situation where a pie chart might actually be the better choice. Share your reasoning! ::: --- # Part 4: Advanced Visualizations {#part4} Now that you've mastered the basics, let's explore some specialized and advanced plot types. ## Heatmaps: Visualizing Matrices {#heatmaps} Heatmaps use color to represent values in a matrix or table. ```{r heatmap_prep, results = 'asis', message=FALSE, warning=FALSE} # Create and scale data heatdata <- pdat |> dplyr::group_by(DateRedux, GenreRedux) |> dplyr::summarise(Prepositions = mean(Prepositions)) |> tidyr::spread(DateRedux, Prepositions) heatmx <- as.matrix(heatdata[, 2:5]) rownames(heatmx) <- heatdata$GenreRedux heatmx <- scale(heatmx) # Standardize ``` ```{r heatmap_plot, message=FALSE, warning=FALSE} heatmap(heatmx, scale = "none", # Already scaled col = colorRampPalette(c("blue", "white", "red"))(50), margins = c(7, 10)) # Adjust label margins ``` **Reading heatmaps:** - **Color intensity**: Magnitude of value - **Dendrograms** (tree diagrams): Show clustering/similarity - **Rows/columns**: Can be reordered to reveal patterns ::: {.callout-tip} ## When to Use Heatmaps - Showing patterns in large matrices - Gene expression data - Correlation matrices - Time-series across categories - Survey responses across questions **Avoid when:** - Data is sparse (many missing values) - Categories don't have natural ordering - Precise values matter more than patterns ::: ## Association Plots: Expected vs. Observed Association plots show deviations from expected frequencies: ```{r assoc_prep, results = 'asis', message=FALSE, warning=FALSE} library(vcd) # Prepare data assocdata <- pdat |> dplyr::mutate( GenreRedux = dplyr::case_when( GenreRedux == "Conversational" ~ "Conv.", GenreRedux == "Religious" ~ "Relig.", TRUE ~ GenreRedux ) ) |> dplyr::group_by(GenreRedux, DateRedux) |> dplyr::summarise(Prepositions = round(mean(Prepositions), 0)) |> tidyr::spread(DateRedux, Prepositions) assocmx <- as.matrix(assocdata[, 2:6]) rownames(assocmx) <- assocdata$GenreRedux ``` ```{r assoc_plot, results = 'asis', message=FALSE, warning=FALSE} assoc(assocmx, shade = TRUE, main = "Association Plot: Genre × Time Period") ``` **Interpreting association plots:** - **Above the line**: More than expected - **Below the line**: Less than expected - **Blue shading**: Significantly more than expected - **Red shading**: Significantly less than expected - **Bar width**: Contribution to chi-square statistic ## Mosaic Plots: Proportional Rectangles ```{r mosaic_plot, results = 'asis', message=FALSE, warning=FALSE} mosaic(assocmx, shade = TRUE, legend = TRUE, main = "Mosaic Plot: Genre Composition Over Time") ``` **Reading mosaic plots:** - **Rectangle size**: Proportion of total - **Color**: Deviation from expected (like association plots) - **Position**: Shows conditional relationships ::: {.callout-note} ## Mosaic vs. Association Plots **Mosaic plots:** - Show proportions visually through rectangle size - Better for understanding composition - Good for presentations **Association plots:** - Emphasize statistical significance - Better for identifying specific deviations - Good for detailed analysis ::: ## Word Clouds: Visualizing Text {#wordclouds} Word clouds show word frequencies. Let's analyze political speeches: ```{r wordcloud_prep, message=FALSE, warning=FALSE} library(quanteda) library(quanteda.textplots) # Load speeches clinton <- base::readRDS("tutorials/dviz/data/Clinton.rda", "rb") |> paste0(collapse = " ") trump <- base::readRDS("tutorials/dviz/data/Trump.rda", "rb") |> paste0(collapse = " ") # Create corpus corp_dom <- quanteda::corpus(c(clinton, trump)) attr(corp_dom, "docvars")$Author <- c("Clinton", "Trump") # Process text corp_dom <- corp_dom |> quanteda::tokens(remove_punct = TRUE) |> quanteda::tokens_remove(stopwords("english")) |> quanteda::dfm() |> quanteda::dfm_group(groups = corp_dom$Author) |> quanteda::dfm_trim(min_termfreq = 200, verbose = FALSE) ``` ### Simple Word Cloud ```{r wordcloud_simple, message=FALSE, warning=FALSE} corp_dom |> quanteda.textplots::textplot_wordcloud(comparison = FALSE, max_words = 50) ``` ### Comparison Cloud ```{r wordcloud_comparison, message=FALSE, warning=FALSE} corp_dom |> quanteda.textplots::textplot_wordcloud( comparison = TRUE, max_words = 50, color = c("blue", "red") ) ``` ::: {.callout-warning} ## Word Cloud Limitations **Problems:** - Words sizes are hard to compare precisely - Common words dominate even after removing stop words - No context (meaning can be misleading) - Can misrepresent emphasis **Better for:** - Initial exploration - Public presentations (engaging but not precise) - Showing overall themes - Complementing (not replacing) quantitative analysis ::: ### Exercise 4.1: Text Analysis {.exercise} ::: {.callout-warning icon=false} ## Interpretation Challenge Looking at the comparison cloud above: 1. What themes differentiate Clinton from Trump? 2. What do the largest words in each color suggest about their campaign focus? 3. What are the limitations of this visualization? 4. What additional analyses would you want to do? **Bonus:** Research "topic modeling" - how might this provide deeper insights than word clouds? ::: ## Flags in Visualizations {#flags} Adding country flags can make international comparisons more engaging: ```{r flags_data} flagsdf <- data.frame( Region = c("Australia", "Canada", "Great Britain", "India", "Ireland", "New Zealand", "United States"), Percent = c(0.022, 0.017, 0.025, 0.010, 0.019, 0.020, 0.036), Kachru = c("Inner circle", "Inner circle", "Inner circle", "Outer circle", "Inner circle", "Inner circle", "Inner circle"), country = c("au", "ca", "gb", "in", "ie", "nz", "us") ) ``` ```{r flags_plot, warning=F, message=F} flagsdf |> ggplot(aes(x = reorder(Region, Percent), y = Percent, country = country, fill = Kachru)) + geom_bar(stat = "identity") + ggflags::geom_flag(size = 5) + geom_text(aes(label = scales::percent(Percent, accuracy = 0.1)), hjust = -0.3, size = 3) + coord_flip(ylim = c(0, 0.045)) + scale_fill_manual(values = c("lightblue", "coral")) + scale_y_continuous(labels = scales::percent) + theme_minimal() + labs(x = "", y = "Vulgar Language Percentage", title = "Vulgar Language Use by English-Speaking Region", fill = "English Type") + theme(legend.position = c(0.8, 0.3), panel.grid.major = element_blank()) ``` ::: {.callout-tip} ## When to Use Flags **Good for:** - International comparisons - Making data more accessible to general audiences - Adding visual interest to country-level data **Requirements:** - Need ISO country codes (e.g., "us", "gb", "au") - Works best with horizontal bar plots - Don't overuse - can look unprofessional in some contexts ::: --- # Part 5: Time Series and Lines {#part5} Time series data shows how things change over time. Line graphs are the go-to visualization. ## Basic Line Graphs {#linegraphs} ```{r line_basic, warning=F, message=F} pdat |> dplyr::group_by(DateRedux, GenreRedux) |> dplyr::summarise(Frequency = mean(Prepositions)) |> ggplot(aes(x = DateRedux, y = Frequency, group = GenreRedux, color = GenreRedux)) + geom_line(size = 1.2) + geom_point(size = 3) + # Add points at data locations scale_color_manual(values = clrs) + theme_minimal() + labs(title = "Preposition Frequency Over Time by Genre", x = "Time Period", y = "Mean Frequency (per 1,000 words)", color = "Genre") ``` ::: {.callout-note} ## Line Graph Essentials - **Points**: Show actual data locations - **Lines**: Show trends/connections - **Group aesthetic**: Tells ggplot which points to connect - **Color**: Distinguishes different series ::: ## Smoothed Line Graphs For continuous time variables, smoothing reveals trends: ```{r line_smoothed, warning = F, message = F} ggplot(pdat, aes(x = Date, y = Prepositions, color = GenreRedux, linetype = GenreRedux)) + geom_smooth(se = FALSE, size = 1.2) + scale_linetype_manual( values = c("solid", "dashed", "dotted", "dotdash", "longdash"), name = "Genre" ) + scale_colour_manual(values = clrs, name = "Genre") + theme_bw() + theme(legend.position = "top") + labs(x = "Year", y = "Relative Frequency\n(per 1,000 words)", title = "Smoothed Trends in Preposition Use") ``` **Why smooth?** - Reduces noise from individual data points - Shows overall trends more clearly - Uses LOESS (locally weighted smoothing) by default - Helpful when you have many data points ### Exercise 5.1: Trends Over Time {.exercise} ::: {.callout-warning icon=false} ## Time Series Analysis Using the smoothed line graph: 1. Which genre shows the strongest increasing trend? 2. Which genre appears most stable over time? 3. Are there any periods of rapid change? 4. Try adding `se = TRUE` to show confidence intervals - what does this add? **Bonus:** Create the same plot but facet by `Region` - do regional patterns differ? ::: ## Ribbon Plots: Showing Uncertainty Ribbon plots display ranges (like min/max or confidence intervals): ```{r ribbon_plot, results = 'asis', message=FALSE, warning=FALSE} pdat |> dplyr::mutate(DateRedux = as.numeric(DateRedux)) |> dplyr::group_by(DateRedux) |> dplyr::summarise( Mean = mean(Prepositions), Min = min(Prepositions), Max = max(Prepositions), SD = sd(Prepositions) ) |> ggplot(aes(x = DateRedux, y = Mean)) + geom_ribbon(aes(ymin = Mean - SD, # ±1 SD ribbon ymax = Mean + SD), fill = "lightblue", alpha = 0.4) + geom_ribbon(aes(ymin = Min, # Min-max ribbon ymax = Max), fill = "gray80", alpha = 0.3) + geom_line(size = 1.2, color = "darkblue") + scale_x_continuous(labels = names(table(pdat$DateRedux))) + theme_minimal() + labs(title = "Preposition Frequency: Mean with Variation", x = "Time Period", y = "Frequency (per 1,000 words)") + ggplot2::annotate("text", x = 2.5, y = 180, label = "Gray = Min-Max range", size = 3) + ggplot2::annotate("text", x = 2.5, y = 170, label = "Blue = ±1 SD", size = 3) ``` **Ribbon plots are excellent for:** - Showing uncertainty - Displaying confidence intervals - Visualizing ranges in forecasts - Comparing variability across time --- # Part 6: Specialized Plots {#part6} Let's explore some specialized plot types for specific scenarios. ## Balloon Plots {#balloonplots} Balloon plots show three variables: two categorical and one continuous. ```{r balloon_plot, results = 'asis', message=FALSE, warning=FALSE} pdat |> dplyr::mutate(DateRedux = factor(DateRedux)) |> dplyr::group_by(DateRedux, GenreRedux) |> dplyr::summarise(Prepositions = mean(Prepositions)) |> ggplot(aes(DateRedux, GenreRedux, size = Prepositions, fill = GenreRedux)) + geom_point(shape = 21, alpha = 0.7) + scale_size_area(max_size = 20) + scale_fill_manual(values = clrs) + theme_minimal() + theme(legend.position = "none", panel.grid.major = element_line(color = "gray90")) + labs(title = "Preposition Frequency: Genre × Time Period", x = "Time Period", y = "Genre", size = "Frequency") ``` **When to use balloon plots:** - Showing three variables simultaneously - Matrix-style comparisons - When circle size is intuitive for your audience **Limitations:** - Hard to compare sizes precisely - Can get crowded with many categories - Consider a heatmap as an alternative ## Dot Plots with Error Bars Showing means with confidence intervals: ```{r dotplot_error, message=F, warning=F} ggplot(pdat, aes(x = reorder(Genre, Prepositions, mean), y = Prepositions, group = Genre)) + stat_summary(fun = mean, # Plot means geom = "point", size = 4, aes(color = Genre)) + stat_summary(fun.data = mean_cl_boot, # Bootstrap CI geom = "errorbar", width = 0.2, size = 1) + coord_cartesian(ylim = c(80, 200)) + #scale_color_manual(values = clrs) + theme_bw(base_size = 12) + theme( axis.text.x = element_text(angle = 45, hjust = 1), legend.position = "none" ) + labs(x = "", y = "Prepositions (per 1,000 words)", title = "Mean Preposition Frequency by Genre", subtitle = "Error bars show 95% confidence intervals") ``` ::: {.callout-important} ## Error Bars vs. Boxplots **Error bars** show: - Specific statistic (mean, median) - Specific uncertainty measure (SE, CI, SD) - Cleaner look for publications **Boxplots** show: - More distributional information - Quartiles and outliers - Better for detecting skewness ::: ### Exercise 6.1: Comparison Challenge {.exercise} ::: {.callout-warning icon=false} ## Statistical Visualization Create two plots of `Prepositions` by `GenreRedux`: 1. A dot plot with error bars (use code above) 2. A boxplot **Compare:** - What does each tell you? - Which shows outliers better? - Which would you use to claim "Genre X has higher frequency than Genre Y"? - When would you choose each? ::: ## Comparative Bar Plots with Negatives Sometimes you want to show deviation from a reference: ```{r negative_bars, message=FALSE, warning=FALSE} # Create example data Test1 <- c(11.2, 13.5, 200, 185, 1.3, 3.5) Test2 <- c(12.2, 14.7, 210, 175, 1.9, 3.0) Test3 <- c(13.2, 15.1, 177, 173, 2.4, 2.9) testdata <- data.frame(Test1, Test2, Test3) rownames(testdata) <- c( "Feature1_Student", "Feature1_Reference", "Feature2_Student", "Feature2_Reference", "Feature3_Student", "Feature3_Reference" ) # Calculate deviations FeatureA <- t(testdata[1, ] - testdata[2, ]) FeatureB <- t(testdata[3, ] - testdata[4, ]) FeatureC <- t(testdata[5, ] - testdata[6, ]) plottable <- data.frame( Test = rep(rownames(FeatureA), 3), Value = c(FeatureA, FeatureB, FeatureC), Feature = rep(c("FeatureA", "FeatureB", "FeatureC"), each = 3) ) # Plot divergence ggplot(plottable, aes(Test, Value, fill = Test)) + facet_grid(vars(Feature), scales = "free_y") + geom_bar(stat = "identity") + geom_hline(yintercept = 0, linetype = "dashed", color = "red") + scale_fill_manual(values = clrs[1:3]) + theme_bw() + theme(legend.position = "none") + labs(x = "Test", y = "Deviation from Reference", title = "Learner Performance: Deviation from Native Speakers", subtitle = "Positive = Above reference, Negative = Below reference") ``` **Use cases:** - Language learner vs. native speaker comparisons - Treatment vs. control groups - Actual vs. expected values - Change from baseline --- # Part 7: Publication-Ready Plots {#part7} Let's pull everything together to create publication-quality visualizations. ## The Anatomy of a Perfect Plot A publication-ready plot needs: 1. **Clear title and subtitle** 2. **Axis labels with units** 3. **Legend (when needed)** 4. **Appropriate theme** 5. **Readable fonts** 6. **Colorblind-friendly palette** 7. **Proper sizing** 8. **Citation/source (when relevant)** ### Example: Building a Complete Plot ```{r publication_plot, warning=F, message=F, fig.width=10, fig.height=6} pdat |> dplyr::group_by(DateRedux, GenreRedux) |> dplyr::summarise( Mean = mean(Prepositions), SE = sd(Prepositions) / sqrt(n()), N = n() ) |> ggplot(aes(x = DateRedux, y = Mean, color = GenreRedux, group = GenreRedux)) + # Data layers geom_line(size = 1.2) + geom_point(size = 3) + geom_errorbar(aes(ymin = Mean - SE, ymax = Mean + SE), width = 0.2, size = 0.8) + # Scales scale_color_manual( name = "Text Genre", values = clrs, labels = c("Conversational", "Fiction", "Legal", "Non-fiction", "Religious") ) + scale_y_continuous( breaks = seq(100, 200, 20), limits = c(100, 200) ) + # Theme and labels theme_bw(base_size = 14) + theme( legend.position = c(0.15, 0.65), legend.background = element_rect(fill = "white", color = "black"), panel.grid.minor = element_blank(), plot.title = element_text(face = "bold", size = 16), plot.subtitle = element_text(size = 12, color = "gray30"), plot.caption = element_text(size = 10, hjust = 0) ) + labs( title = "Historical Trends in Preposition Usage", subtitle = "Analysis of English texts from 1150-1913", x = "Time Period", y = "Mean Frequency (per 1,000 words)", caption = "Source: Penn Parsed Corpora of Historical English (PPC)\nError bars show ±1 SE" ) ``` ### Saving High-Quality Figures ```{r save_plot, eval=F} # Save for publication ggsave("preposition_trends.png", width = 10, height = 6, dpi = 300) # Save for presentation ggsave("preposition_trends.pdf", width = 10, height = 6) # Save for web ggsave("preposition_trends_web.png", width = 10, height = 6, dpi = 150) ``` ::: {.callout-tip} ## File Format Guide **PNG** - Best for: - Web use - Presentations - Figures with photos or complex gradients - When file size matters **PDF** - Best for: - Publications (journals often require vector) - Posters - When scaling is needed - Print materials **TIFF** - Best for: - Some journal requirements - Archival purposes **DPI (resolution):** - Web: 72-150 dpi - Presentations: 150 dpi - Print: 300 dpi - Posters: 600 dpi ::: ## Color Accessibility Making plots accessible to colorblind readers: ```{r colorblind_demo, message=F, warning=F} library(viridis) # Original plot with problematic colors p1 <- pdat |> dplyr::group_by(DateRedux, GenreRedux) |> dplyr::summarise(Mean = mean(Prepositions)) |> ggplot(aes(DateRedux, Mean, fill = GenreRedux)) + geom_bar(stat = "identity", position = "dodge") + scale_fill_manual(values = c("red", "green", "blue", "yellow", "purple")) + ggtitle("Problematic Colors") + theme_minimal() + theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1)) # Improved with viridis palette p2 <- pdat |> dplyr::group_by(DateRedux, GenreRedux) |> dplyr::summarise(Mean = mean(Prepositions)) |> ggplot(aes(DateRedux, Mean, fill = GenreRedux)) + geom_bar(stat = "identity", position = "dodge") + scale_fill_viridis_d() + ggtitle("Colorblind-Friendly") + theme_minimal() + theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1)) gridExtra::grid.arrange(p1, p2, nrow = 1) ``` **Colorblind-friendly palettes:** - `scale_color_viridis_d()` / `scale_fill_viridis_d()` - `scale_color_brewer()` with "Set2", "Dark2", or "Paired" - ColorBrewer palettes (many are colorblind-safe) ### Exercise 7.1: Publication Polish {.exercise} ::: {.callout-warning icon=false} ## Final Project Create a publication-ready visualization: 1. Choose any relationship in the data 2. Create a complete plot with: - Informative title and subtitle - Proper axis labels with units - A colorblind-friendly palette - Appropriate theme - Source citation - Legend if needed 3. Save it in three formats (PNG, PDF, web-optimized PNG) 4. Write a 2-3 sentence caption that could accompany the figure in a paper **Peer review:** Exchange with a colleague - is your plot self-explanatory? ::: --- # Part 8: Choosing the Right Plot {#part8} The hardest part of data visualization is choosing which plot to make. Let's develop a decision framework. ## Decision Tree ```{r decision_tree, echo=FALSE, eval=FALSE} # This would be a visual decision tree - described in text below ``` ### Start Here: What's Your Data Structure? #### 1. One Continuous Variable **Goal:** Show distribution - **Few data points (<50):** Dot plot, strip plot - **Medium data (50-500):** Histogram, density plot - **Many data (500+):** Density plot, violin plot - **Want statistics:** Boxplot #### 2. One Continuous + One Categorical **Goal:** Compare groups - **Compare distributions:** Boxplot, violin plot, ridge plot - **Compare means:** Dot plot with error bars - **Show all data:** Jittered points, beeswarm plot #### 3. Two Continuous Variables **Goal:** Show relationship - **Basic relationship:** Scatter plot - **Many points (overlap):** Hex plot, 2D density - **Add trend:** Add `geom_smooth()` - **Compare groups:** Color by group, facet by group #### 4. Two Categorical Variables **Goal:** Show associations - **Frequencies:** Bar plot (grouped or stacked) - **Proportions:** 100% stacked bar, mosaic plot - **Statistical test:** Association plot #### 5. Time Series **Goal:** Show change over time - **Discrete time points:** Line graph with points - **Continuous time:** Smoothed line, ribbon plot - **Multiple series:** Colored lines, small multiples - **Uncertainty:** Ribbon plot, error bars #### 6. Three+ Variables **Goal:** Show multivariate relationships - **Third variable categorical:** Color/shape, facets - **Third variable continuous:** Color gradient, bubble size - **Many variables:** Heatmap, parallel coordinates ## Common Scenarios and Solutions ### Scenario 1: Survey Results **Data:** Likert scale responses from 5 groups **Options:** 1. **gglikert plot** (best for multiple questions) 2. Stacked bar chart (100% for proportions) 3. Faceted bar charts (best for comparing specific responses) **Choose based on:** - Number of questions (many → gglikert) - Focus on specific categories (faceted bars) - Showing overall sentiment (stacked bars) ### Scenario 2: Experimental Results **Data:** Measurements from treatment and control groups **Options:** 1. **Boxplots** (show distributions + outliers) 2. Violin plots (show distribution shape) 3. Bar plot with error bars (show means + uncertainty) **Choose based on:** - Sample size (small → dot plot, large → violin) - Presence of outliers (boxplot shows these) - Simplicity needed (bar + error = simplest) ### Scenario 3: Geographic Data **Data:** Values across countries/regions **Options:** 1. **Map** (when geography matters) 2. Bar plot with flags (when ranking matters) 3. Dot plot (when precision matters) **Choose based on:** - Audience familiarity with geography - Whether spatial patterns matter - Number of regions (too many for map) ### Exercise 8.1: Plot Selection Challenge {.exercise} ::: {.callout-warning icon=false} ## Real-World Scenarios For each scenario, choose the best plot type and explain why: **Scenario A:** You have test scores (0-100) for students in 4 different teaching methods. You want to know if methods differ significantly. **Scenario B:** You've measured reaction times (milliseconds) in 20 trials for each of 50 participants. **Scenario C:** You surveyed 200 people about their agreement (5-point scale) with 10 statements about climate change. **Scenario D:** You have daily temperature readings for 5 cities over one year. For each: 1. What plot type would you use? 2. What alternatives did you consider? 3. What would make you change your choice? ::: ## Common Mistakes to Avoid ### ❌ Mistake 1: 3D Charts **Problem:** Hard to read, distort data ```{r bad_3d, eval=FALSE} # DON'T DO THIS # 3D plots are almost never appropriate for data visualization ``` **Instead:** Use 2D charts with proper grouping/faceting ### ❌ Mistake 2: Dual Y-Axes **Problem:** Can be misleading, hard to interpret **Instead:** - Facet plots (separate panels) - Normalize to same scale - Use secondary metric only if essential ### ❌ Mistake 3: Too Many Colors **Problem:** Confusing, hard to distinguish **Instead:** - Limit to 5-7 colors - Use ColorBrewer palettes - Consider faceting instead ### ❌ Mistake 4: Truncated Y-Axis (Bar Plots) **Problem:** Exaggerates differences **Rule:** Bar plots should always start at zero **Exception:** Dot plots with error bars can use truncated axes ### ❌ Mistake 5: Chartjunk **Problem:** Decoration distracts from data **Avoid:** - Unnecessary grid lines - Decorative backgrounds - 3D effects - Shadows and gradients (usually) **Instead:** Use `theme_minimal()` or `theme_bw()` as starting points ## The Grammar of Graphics Framework ggplot2 is based on "The Grammar of Graphics" - understanding this helps you think about plots systematically. **Every plot has:** 1. **Data** - What you're visualizing 2. **Aesthetics** (aes) - What goes where (x, y, color, size, etc.) 3. **Geometries** (geom) - How to display it (points, lines, bars, etc.) 4. **Scales** - How aesthetics map to visual properties 5. **Facets** - Subplots 6. **Themes** - Non-data visual elements **Building blocks:** ```{r grammar_example, eval=FALSE} ggplot(data = <DATA>) + aes(x = <X>, y = <Y>, color = <GROUP>) + # Aesthetics geom_<TYPE>() + # Geometry scale_<AESTHETIC>_<TYPE>() + # Scales facet_<TYPE>(vars(<VARIABLE>)) + # Facets theme_<STYLE>() + # Theme labs(title = <TITLE>, ...) # Labels ``` This modular approach lets you build any plot by combining these components! --- # Final Challenge: Capstone Project {#capstone} ::: {.callout-warning icon=false} ## Comprehensive Data Visualization Project You've learned all the essential techniques. Now put them together! ### Your Task Create a complete data story using the `pdat` dataset (or your own data). Your project should include: **Required Components:** 1. **At least 3 different plot types** from different sections: - One showing distributions - One showing relationships - One showing categorical comparisons 2. **Publication-ready quality:** - Proper titles, labels, and captions - Colorblind-friendly palette - Appropriate themes - Clear legends 3. **A narrative:** - 2-3 paragraph introduction explaining your question - Transition text between plots explaining what each shows - 2-3 paragraph conclusion summarizing findings 4. **Technical elements:** - At least one faceted plot - At least one customized plot (colors, themes, labels) - Proper use of aesthetics (color, shape, size) ### Example Questions to Explore - How has language use evolved across different genres over time? - Are there regional differences in writing styles? - What patterns exist in the data that might surprise a linguist? - Can you predict time period based on linguistic features? ### Deliverables 1. **R Markdown document** with all code and narrative 2. **3-5 high-quality figures** saved as PNG (300 dpi) 3. **One "highlight figure"** that tells your main story ### Evaluation Criteria Your project will be strong if it: - ✅ Chooses appropriate plot types for each question - ✅ Uses visualization best practices (clear labels, readable fonts, etc.) - ✅ Tells a coherent story with the data - ✅ Shows technical mastery of ggplot2 - ✅ Includes thoughtful interpretation of results - ✅ Is reproducible (all code runs without errors) **Bonus points for:** - Creative combinations of techniques - Particularly insightful findings - Exceptional visual design - Going beyond the tutorial examples ::: --- # Resources and Next Steps {#resources} ## Recommended Books 1. **"ggplot2: Elegant Graphics for Data Analysis"** by Hadley Wickham - The definitive ggplot2 guide - [Online version](https://ggplot2-book.org/) 2. **"Data Visualization: A Practical Introduction"** by Kieran Healy - Excellent for understanding principles - Sociology focus but broadly applicable 3. **"Fundamentals of Data Visualization"** by Claus Wilke - Free online: https://clauswilke.com/dataviz/ - Best for understanding when to use each plot type ## Online Resources **Interactive Learning:** - [R Graph Gallery](https://r-graph-gallery.com/) - Hundreds of examples with code - [Data to Viz](https://www.data-to-viz.com/) - Decision tree for choosing plots - [From Data to Viz](https://www.data-to-viz.com/#explore) - Interactive explorer **Reference:** - [ggplot2 documentation](https://ggplot2.tidyverse.org/) - [R Color Reference](http://www.stat.columbia.edu/~tzheng/files/Rcolor.pdf) - [ColorBrewer](https://colorbrewer2.org/) - Choose palettes **Advanced Topics:** - [Patchwork](https://patchwork.data-imaginist.com/) - Combining multiple plots - [gganimate](https://gganimate.com/) - Animated visualizations - [plotly](https://plotly.com/r/) - Interactive plots - [rayshader](https://www.rayshader.com/) - 3D visualizations (when appropriate!) ## Cheat Sheets Download and print these: - [ggplot2 cheat sheet](https://github.com/rstudio/cheatsheets/blob/main/data-visualization.pdf) - [RStudio IDE cheat sheet](https://github.com/rstudio/cheatsheets/) ## Common Problems and Solutions ### "My plot is too crowded" **Solutions:** - Facet into multiple panels - Filter to top N categories - Use color to highlight key groups - Try a different plot type (e.g., heatmap instead of scatter) ### "Colors look different in different programs" **Solutions:** - Use colorblind-safe palettes - Test in target environment - Save as PDF (preserves colors better) - Specify colors explicitly with hex codes ### "Text overlaps in my plot" **Solutions:** - Rotate labels: `theme(axis.text.x = element_text(angle = 45, hjust = 1))` - Use `ggrepel::geom_text_repel()` - Reduce number of labels - Increase plot size - Abbreviate labels ### "Error: object not found" **Solutions:** - Check spelling of variable names - Ensure data is loaded - Check if library is loaded - Use `str(data)` to see variable names ### "Plot looks pixelated" **Solutions:** - Increase DPI: `ggsave(..., dpi = 300)` - Save as PDF (vector format) - Increase figure size - Avoid resizing after saving ## Where to Get Help 1. **Stack Overflow:** Tag your question with `[r]` and `[ggplot2]` 2. **RStudio Community:** https://community.rstudio.com/ 3. **R for Data Science Slack:** https://www.rfordatasci.com/ 4. **Twitter #rstats:** Active, helpful community ## Practice Datasets To continue learning, try these datasets: **Built into R:** - `mpg` - Fuel economy data - `diamonds` - Diamond prices and properties - `economics` - US economic time series - `midwest` - Demographic data **From packages:** - `gapminder` - Global health and wealth - `nycflights13` - Flight data - `fivethirtyeight` - Data from news articles - `palmerpenguins` - Alternative to iris dataset ## Your Learning Path **Beginner → Intermediate:** 1. ✅ Master basic geoms (point, line, bar, box) 2. ✅ Understand aesthetics and mapping 3. ✅ Learn faceting 4. ✅ Customize themes 5. ⬜ Combine multiple plots (patchwork) 6. ⬜ Create custom themes 7. ⬜ Build functions for repeated plots **Intermediate → Advanced:** 1. ⬜ Master scales and coordinates 2. ⬜ Custom annotations 3. ⬜ Statistical transformations 4. ⬜ Extension packages (gganimate, ggraph, etc.) 5. ⬜ Interactive visualizations (plotly) 6. ⬜ Creating your own geoms 7. ⬜ Publication-ready figure workflows --- # Citation & Session Info {.unnumbered} Schweinberger, Martin. 2025. *Mastering Data Visualization with R*. Brisbane: The University of Queensland. url: https://ladal.edu.au/tutorials/dviz/dviz.html (Version 2025.02.07). ``` @manual{schweinberger2026dviz, author = {Schweinberger, Martin}, title = {Mastering Data Visualization with R}, note = {https://ladal.edu.au/tutorials/dviz/dviz.html}, year = {2026}, organization = {The University of Queensland, School of Languages and Cultures}, address = {Brisbane}, edition = {2026.02.07} } ``` ## Session Information ```{r sessioninfo} sessionInfo() ``` --- ## Acknowledgments This tutorial builds on the excellent work of the R and tidyverse communities. Special thanks to: - Hadley Wickham for creating ggplot2 - The RStudio team for tools and resources - All package authors cited throughout - The LADAL team for supporting this tutorial --- **[Back to top](#welcome)** **[Back to HOME](/)** --- # Quick Reference Tables {.unnumbered} ## Common Geoms Reference | Geom | Use For | Example | |------|---------|---------| | `geom_point()` | Scatter plots | Relationship between 2 continuous variables | | `geom_line()` | Line graphs | Time series, trends | | `geom_bar()` | Bar plots | Categorical frequencies | | `geom_boxplot()` | Boxplots | Distribution summaries | | `geom_violin()` | Violin plots | Distribution shapes | | `geom_histogram()` | Histograms | Single variable distributions | | `geom_density()` | Density plots | Smooth distributions | | `geom_smooth()` | Trend lines | Adding regression/smoothing | | `geom_errorbar()` | Error bars | Showing uncertainty | | `geom_tile()` | Heatmaps | Matrix visualizations | | `geom_hex()` | Hex bins | Large scatter plots | | `geom_density_2d()` | 2D density | Concentration in 2D | ## Common Aesthetics | Aesthetic | Controls | Example Variables | |-----------|----------|-------------------| | `x` | X-axis position | Continuous or categorical | | `y` | Y-axis position | Continuous or categorical | | `color` | Border/line color | Groups, categories | | `fill` | Fill color | Groups (for bars, boxes, etc.) | | `size` | Point/line size | Continuous variables | | `shape` | Point shape | Categories (max ~6) | | `alpha` | Transparency | Continuous (0-1) | | `linetype` | Line type | Categories | ## Common Themes | Theme | Description | |-------|-------------| | `theme_bw()` | Black and white, minimal | | `theme_minimal()` | Minimal theme, no background | | `theme_classic()` | Classic look, axis lines | | `theme_void()` | Empty theme | | `theme_dark()` | Dark background | | `theme_grey()` | Default ggplot2 theme | ## Position Adjustments | Position | Use For | |----------|---------| | `position_dodge()` | Side-by-side bars | | `position_stack()` | Stacked bars/areas | | `position_fill()` | 100% stacked | | `position_jitter()` | Avoid overplotting | | `position_identity()` | Use exact values | --- **Remember:** The best visualization is the one that clearly communicates your message to your audience! 📊